I recently asked a PI how much data space he thought he and his team of researchers might need over, say, the next two years. Based on his current research he thought perhaps, 1-2 GB per project would be about right on average, although he did have one anomalous project the raw data for which is video records for which he thought about 50 GB would be ample. He then went on to say that predicting with any confidence how much research data would be generated by his research team over a longer time frame would be difficult since that would mean knowing what research they were doing, and that in turn would depend on what money was being put on the table, choices about bid writing, and bid success, studentships on offer and so on.
I decided to pursue the ‘anomalous’ project by talking to the project researcher in question. This what he said:
‘I originally recorded three main types of data: computer logging information, screen capture and webcam video, and mobile camera footage. Of these I generated (per week) about 1MB of logging data, about 26GB of screen capture and webcam footage and about 24GB of mobile footage.
‘I recorded data for three weeks (~150GB) and stored all of the originals for this period in addition to the converted files necessary for coding. I did not, however, store all the original files for the mobile video camera due to their size, approximately reducing them by half.
‘In addition to this I deleted a further nine weeks of data (~ 600-700GB) gathered during the participants’ acclimatisation period. Had the storage been available I would have kept and converted this data also, because there are a number of extremely relevant research questions that could be investigated based on it. This would have given an uncompressed total somewhere in the region of 1TB.’
Quite a difference, then, from the original 50 gigabytes guessed at by the PI.
This raises a number of questions. The first is about the confidence with which one can ask questions at ‘one remove’ about data use and have faith in the answers; even when there is no real prediction involved. When prediction is involved based on insufficient information then the quality of the answer is likely to be poor. If a question is motivated by the need to plan ahead, a ‘good’ answer is needed if good plans are to be made. Good data management planning requires good information: the inelegant dictum: ‘garbage in: garbage out’ applies. We are currently being asked to answer many questions, or having ourselves to ask them, about future data management needs. How, then, to we get good information upon which to base our data management?
Equally important, though, is the question of information proliferation. Currently there are about 130 GB of data on file for the project in question. With good house-keeping (aka, delete-and-be-damned) there might, say, be 50 gigabytes. How much of the 1TB that might be collected by project end would have been kept if the principal driver were to maximize the amount of data available for re-use? Presumably all of it, together with additional contextualizing data to maximize its amenability to re-use. But of course, that’s not the end of it: this data will need to be ‘managed’ during the project in accordance with the data management policies in force, and managed thereafter during the period appropriate to its continued usefulness and research funder policy.
So my question is: ‘to what extent will our data storage needs and concomitant management effort be inflated by the act of formalizing research data management?’.
And: ‘Can the research budget afford it?’.
Your views, as ever, most welcome.