Skip to content

Categories:

Scoping Data Management: inflation, inflation!

I recently asked a PI how much data space he thought he and his team of researchers might need over, say, the next two years. Based on his current research he thought perhaps, 1-2 GB per project would be about right on average, although he did have one anomalous project the raw data for which is video records for which he thought about 50 GB would be ample. He then went on to say that predicting with any confidence how much research data would be generated by his research team over a longer time frame would be difficult since that would mean knowing what research they were doing, and that in turn would depend on what money was being put on the table, choices about bid writing, and bid success, studentships on offer and so on.

I decided to pursue the ‘anomalous’ project by talking to the project researcher in question. This what he said:

‘I originally recorded three main types of data: computer logging information, screen capture and webcam video, and mobile camera footage. Of these I generated (per week) about 1MB of logging data, about 26GB of screen capture and webcam footage and about 24GB of mobile footage.

‘I recorded data for three weeks (~150GB) and stored all of the originals for this period in addition to the converted files necessary for coding. I did not, however, store all the original files for the mobile video camera due to their size, approximately reducing them by half.

‘In addition to this I deleted a further nine weeks of data (~ 600-700GB) gathered during the participants’ acclimatisation period. Had the storage been available I would have kept and converted this data also, because there are a number of extremely relevant research questions that could be investigated based on it. This would have given an uncompressed total somewhere in the region of 1TB.’

Quite a difference, then, from the original 50 gigabytes guessed at by the PI.

This raises a number of questions. The first is about the confidence with which one can ask questions at ‘one remove’ about data use and have faith in the answers; even when there is no real prediction involved. When prediction is involved based on insufficient information then the quality of the answer is likely to be poor. If a question is motivated by the need to plan ahead, a ‘good’ answer is needed if good plans are to be made. Good data management planning requires good information: the inelegant dictum: ‘garbage in: garbage out’ applies. We are currently being asked to answer many questions, or having ourselves to ask them, about future data management needs. How, then, to we get good information upon which to base our data management?

Equally important, though, is the question of information proliferation. Currently there are about 130 GB of data on file for the project in question. With good house-keeping (aka, delete-and-be-damned) there might, say, be 50 gigabytes. How much of the 1TB that might be collected by project end would have been kept if the principal driver were to maximize the amount of data available for re-use? Presumably all of it, together with additional contextualizing data to maximize its amenability to re-use. But of course, that’s not the end of it: this data will need to be ‘managed’ during the project in accordance with the data management policies in force, and managed thereafter during the period appropriate to its continued usefulness and research funder policy.

So my question is: ‘to what extent will our data storage needs and concomitant management effort be inflated by the act of formalizing research data management?’.

And: ‘Can the research budget afford it?’.

Your views, as ever, most welcome.

Posted in Uncategorized.


Extending the RDM Benefits Envelope

An opportunity was extended at the MRD Programme Phase 2 launch workshop to brainstorm the benefits and evidence thereof of research data management for, in particular, the institution-level projects. Since REDm-MED is not one of these I found myself assisting my colleagues on the University of Bath’s other MRD project: Research360.

Neil Beagrie provided ‘seed corn’ for this event, by way of a summary of benefits collated from those identified by the RDM infrastructure projects; from these we developed a number of ideas which, no doubt, R360 will report in due course.

One idea, however, is of particular interest not only to the Research360 project but to any project which represents research which commonly has external collaborators – such as does the engineering research that REDm-MED hopes to support.

It occurred to us that one class of beneficiary had been overlooked, this being the Industry Collaborator. Although ‘external’ with respect to the institution, it is clear that the some of the benefits of research data management that occur within academia will spill over into industry through the process of collaboration. Likewise, the research data management tools, methods and practice developed for use in academia will be directly adoptable within industry itself. Clearly, since funders are always interested in industry collaboration they will be pleased to see such benefits identified. Here then, under the heading of Benefits for Industry are our first thoughts.

  • Greater confidence in sharing data through better, more case-appropriate and more transparent security measures.
  • Better access to new data or data hitherto undiscoverable.
  • Reduction in data loss as a result of improvements in management throughout the data life cycle.
  • A better understanding of good RDM practice.

All these (together with other benefits already identified through the MRD Programme activities) will lead to the high-level benefit of a cultural change in RDM practice.

Suggestions for additions to this list would be welcome.

Posted in Uncategorized.


Discrimination

Our, or at least my,  current – immature – thinking on research data management tends toward a ‘one-size-fits-all’ approach irrespective of circumstances of the data management case.

It is quite clear that there is a great deal of inter-discipline and, within disciplines, inter-project differences in the character of the data and data sets encountered. It seems reasonable to suspect that data of different characters might have different management demands. At the same time, the motivation for data management of a data set and the associated anticipated use to which that data set might be put will be different case-by-case. It is reasonable to assume that the data management demands for, say, audit, will be different for, say, re-use for a future known purpose. Similarly, the time frame over which data is intended to managed for re-use will differ; and it might be reasonable to assume that the greater the time between data generation and the time of expected re-use, the greater the resources it will be necessary to expend to ensure re-usability. There will be no point in applying the same resources to data that will be re-used a week-next-Tuesday as to those the re-use of which is anticipated years or even decades hence.

Taken together these three dimensions (to which other might no doubt easily be added) suggest that discrimination between data sets will be necessary to achieve efficient and effective data management. If discrimination is not made, then a ‘one-size-fits-all’ approach really will have to be taken, with the concomitant waste of resources that will inevitably result.

So: ‘How do we achieve this discrimination case-by-case? Then, having discriminated, ‘How do we respond to optimize data management for the case?’.

If these question haven’t been asked before, they have now. If they have been asked, would someone please direct me to some of the answers?

Posted in Uncategorized.


REDm-MED at Christmas

The REDm-MED Project approaches the Christmas break in good order, I think, with progress having been made in the three main tasks, of developing the requirements specification for the DMP for the University of Bath’s Dept. of Mech. Engineering, getting the development of the RAIDmap tool underway, and the various ‘outreach’ and project management activities.

We have been using the CARDIO Tool, developed by our DCC colleagues, to help characterize the current strengths and weaknesses of the Department’s (and by association) the University’s data management support. To do this we have ‘volunteered’ a panel of – currently – seven respondents. We have nearly completed Stage 2 of the process, which invites the respondents to complete a triumvirate of questionnaires eliciting their views on the operational, technical and resource aspects of current support. The CARDIO tool is impressive, although from time-to-time we do feel like Beta Testers! We would encourage the use of CARDIO not only because of its very evident usefulness, but because use and feedback to the developers will help them in their refinement and make it better still. I should mention especially the help of Brian Aitken in fielding and sorting out the small hiccoughs we have encountered in using CARDIO for the first time.

For development of our RAIDmap (Research Activity Information Development) data association tool, we have chosen a ‘cut-down’ Agile Software Development approach, using Javascript; ‘cut-down’ because our development ‘team’ is small (though perfectly formed!) consisting solely of Uday Thangarajah, and therefore we don’t need the full panoply of Agile measures. Although guided by a high-level functional specification, in the spirit of agility we are initially specifying minimal functionality, which will be increased iteratively with input from the ‘stakeholders’ as we proceed.  Where possible we are using existing – preferably open-source – application elements to provide core functionality.

As for Project Management and outreach, well, at the beginning of the month I had the opportunity to present the project at the MRD Programme launch meeting and discuss it and other projects with some familiar and new faces and Alex Ball represented the project at the 7th International Digital Curation Conference in Bristol.

Slides and poster are available on the REDm-MED web site.

We look forward to the challenges of 2012!

Posted in Uncategorized.


Keeping the REDm-MED Project on Track

Our experiences observing the varied success of introducing knowledge and information management tools and methods in industry brought us to the following conclusion if success is to be hoped for:

‘Interventions should result in a zero net resource requirement increase.’

It does not mean we think that no money or effort should be spent on performing, for example, data management. Rather, that the total effort spent by those working with data – principally researchers – should not increase when new data management practices are introduced. For this to happen, either the extra effort spent on data management must save researchers at least that much effort elsewhere in their work, or the burden of data management must fall elsewhere.

Some of the previous projects we’ve been involved with produced guidance which, on the basis of ‘do as you would be done by’, might inform the REDm-MED Project.

From Principles for Through Life Management of Engineering Information:

Principle 1 — the Principle of Parsimony: Create, record and retain information only as necessary.

From Principles for Managing Engineering Research Data:

Principle 12. The tools put in place to assist in the satisfaction of requirements specified in data management plans should be simple, engaging and easy to access. (Principle inspired by the JISC-funded Incremental Project.)

From Jones, K, 2011. Assessing Institutional Data Storage and Management using the Data Asset Framework (DAF) Methodology at the University of Bath. Opus 24960

Recommendation 6. Archiving of data will be automatic, simple and easy.

From this we have synthesized an aspiration and rubric for the REDm-MED Project:

We will aim to provide guidance and tools to aid practical RDM planning which are simple and engaging to use, easy to access and which require least effort on the part of the users.

Wish us luck!

Posted in Uncategorized.