Research360

Managing data across the institutional research lifecycle

Monthly Archives: February 2012

Research Data Management 101 — Intro & definitions

  , , ,

📥  Training

On Wednesday 15 February, we ran our first workshop/focus group with PhD students from the Doctoral Training Centre for Sustainable Chemical Technologies. This is the first of a series of posts summarising the outcomes of that event.

Overview

We had three aims for this session:

  • To introduce the participants to data management planning and have them start writing their own data management plan (DMP);
  • To better understand their current knowledge so that we can plan future training activities;
  • To get feedback on what DMP template would be appropriate for PGR students.

We ran the session with 10 students in the 2010 cohort, who all started in October 2010 and are currently in the first year of their PhD proper, having completed an MRes in 2011. We also invited the 2009 cohort (in their second PhD year), of whom 3 volunteered.

The session consisted of an introductory presentation, given by Professor Matthew Davidson, followed by a hands-on session during which the students worked through a DMP template with support from myself and Cathy Pink. Our colleagues Kara Jones and Katy Jordan from the library were also present, and made notes on what was discussed.

Data management definitions

Early on in the session, we split the students up into groups of 2–3 and asked them to discuss what they understood by a handful of common data management terms. Here's what they came up with:

Data
There was general consensus (as you might expect from a single-discipline group) that data is information gathered directly by experiment, survey, etc. for the purposes of research. It became clear that with more thought, ‘data’ isn’t a hard-edged concept — processing data can produce new data, metadata is also data and so perhaps are the samples from which experimental data were derived.
Metadata
Metadata was described as data behind the data you want to use that gives context and background details. It was noted that this is distinct from the data itself. Chemistry is relatively rare in having a strong history of using metadata in the context of depositing crystallographic data.
Secure storage
The students immediately identified the two sides of security: both guaranteeing that data is (and remains) accessible to those who create and use it, and that it cannot be accessed without permission. It was generally agreed that your required level of security depends on how sensitive your data is.
Access
The most important aspect was seen as ensuring access for the researchers who created the data. Raw data was perceived as not being of much interest to third parties, but a need to better preserve and share experimental protocols was identified.
Intellectual property
It was generally accepted that, for PGR students, the university owns their data and the intellectual property therein. We're hoping to clarify this with our legal team soon, as Bath is unusual in leaving ownership of "scholarly outputs" with the originators — it would be useful to know whether we define data as a scholarly output now. Good data management practice was identified as one way to create a 'paper trail' to prove ownership of ideas in the event of a patent dispute.

Thoughts

Katy Jordan made an interesting comment in her notes:

"Listening in, it struck me quite forcibly that this session needs academics from the relevant department(s) to lead it.  A good level of familiarity with the field, its processes, the department itself, and the way research is carried out, is required to make the session meaningful for the students."

It's occurred to me (and others) before that although the core skills of research data management are mostly discipline-independent, there is a strong need to provide "discipline-flavoured" training sessions, with relevant examples and expertise to ensure that the participants can relate to the content.

We'll be following up soon with more posts on the later part of the session, particularly a discussion of the DMP templates the students tried. Watch this space!

Object stores

  , , , ,

📥  Technology

Kitchen ShelvesAlthough my involvement in Research360 is at the level where technology and people interact, I’m also doing my best to understand how our infrastructure is developing at a much lower level so that I’m in a position to better advise non-technical stakeholders.

Bath University Computing Services (BUCS) are currently in the process of procuring a new file store which works in a very different way to our existing storage systems, and I recently had the opportunity to learn more about it from our Database & Systems Manager, Paul Jordan. Since this is a very new area for me, my apologies to you and him for anything that I’ve got wrong.

Like our existing storage, this will be arranged into tiers, with Tier 1 containing the most expensive storage with the quickest access times, and lower tiers providing slower but cheaper storage. Data will be moved between tiers automatically (and invisibly to users) based on configured policies.

Where this new storage differs from our existing systems is that the lowest tier will not be a tape carousel, but an “object store”. Where traditional a file system stores data in an ordered, hierarchical way, an object store stores individual data objects in a flat namespace.

The major advantage of this is that much more of the available space on the physical disks can be used to store actual user data: the the overhead is much lower than for traditional filesystems. By virtualising storage across a network in a new way, it’s also very much more scalable than anything we currently use — we could easily grow this to the petabyte level or expand out into the cloud if need be.

Now, most users need never know that their data is stored in an object store, just like they don’t need to know whether the disks were made by Hitachi or Western Digital. An extra layer on top does some translation, allowing you to store files over the network just like any other networked attach storage (NAS). Users can access it via a mapped drive in Windows or an NFS mount .

However the object store is also accessible directly via a RESTful API over HTTP/HTTPS (in fact, that’s how the NAS layer interacts with it too). Despite being sold as a replacement for tape archival, it’s very quick to access over the network, and authentication of users via LDAP or Active Directory is also built in. In addition to this, an object store can perform other clever functions during or after ingestion, such as transforming data into other formats or making use of metadata.

It therefore seems like the perfect back-end to a digital repository such as EPrints, DSpace or Fedora. A load of overhead could be cut down by having the repository target the object store directly, rather than doing so via files on a virtual file system using the NAS layer.

Alternatively, if the object store itself is clever enough, it could be used directly as a repository, using only a very thin user interface on top. A SWORD2-compliant interface would open up even more options.

If you’re interested in learning more, there are a number of white papers and other resources available on the Hitachi Content Platform web page.

Are other institutions implementing similar types of storage? Is it possible to integrate a repository with an object store directly via HTTP and if so has it been done?

It would be interesting to hear from anyone else who’s come across anything similar.

Image credit: Kitchen Shelves by John Martinez Pavliga

South-west meetup

📥  General

On Wednesday 1 February, we met up with representatives of three other universities in the south-west area to discuss and find common ground on our JISC Managing Research Data projects. Represented were:

Each institution has its own unique set of requirements, and one of the first things we discovered was how well our projects complemented each other. Research360 is focusing on Science and Engineering, data.bris on Arts and Humanities and UWE’s project on Health and Life Sciences; OpenExeter is further down the road than the rest of us, and is rolling out data management across the University of Exeter.

As well as these differences, we also picked out many areas of commonality in which we can work together.

Training

We identified some potential for linking up for shared train-the-trainer events to help our support staff to get up to speed. The data management agenda implies new skills to be acquired right across our institutions, from researchers and research students through to IT supporters and librarians.

Engagement and advocacy

It was noted that “advocacy” shades into “training” quite subtly — especially as many people feel that a need for training implies that they can’t do their job properly. There’s particularly a need to minimise the need for training by integrating data management processes with the research workflow as transparently as possible.

Bristol have a champion in the Faculty of Arts office who is very good at spotting and passing on bids and other queries which relate to data management — this sounds like a useful approach.

There are differing opinions (in the sector generally) about who should have responsibility for data management advice and support. In the long term new staff will need to be recruited, but in the short term it’s about up-skilling existing staff appropriately. The danger here is rising demand for support may outstrip supply, and we'll all be working hard to manage expectations and ensure this doesn't happen.

Repositories

We all have an institutional repository of some sort, mostly for publications, and are keen to develop digital repositories for data too. Both UWE and Bath have EPrints-based repositories and are evaluating whether EPrints will be suitable for data as well.

Research information management

As well as developing repositories for data, Bath, Bristol and Exeter are currently implementing Current Research Information Systems to aid centralised monitoring of research outcomes (especially important for REF2014). Bath and Bristol are using Pure, with Exeter already having established Symplectic — we’re all interested in ways to incorporate information about research data into these systems.

Policy

Discussing policy development is tricky, as it can directly affect competitiveness. Nonetheless, it’s clear that some collaboration can be profitable so we’ll be looking at ways that we can do this appropriately.

We’re also all planning to send representatives to the upcoming policy workshop in Leeds.

Requirements gathering

We’re all making use of various structured tools, such as DAF and CARDIO, so will be able to share information about how well these tools work for us, along with the general impressions about the results they bring us.

Conclusion

We all went away from the meeting with a lot to think about and a few interesting ideas, so stay tuned for more there. In addition to this post, there are also blog posts from Exeter and UWE for you to take a look at.

Many thanks to everyone who made the meeting worthwhile by contributing, and to Exeter for agreeing to host another in a few months time.

Progress update: February 2012

  , , , ,

📥  Progress updates

Progress update: February 2012

On Thursday 2 February 2012, we got together for a project team meeting, so leading on from that, here’s a brief progress report:

  • As we’re starting to get a number of requests for help specifically related to research data management, we agreed to set up a queue on RT (Request Tracker, our support ticket tracking system) to deal with these; this will give users a single point of contact and allow us to measure the volume and type of incoming queries and our capacity to deal with them;
  • The university’s data management web page has been updated and tweaked to provide a more useful experience until we have fully redeveloped that area of the website;
  • Project start up (Work Package 1) is now complete;
  • We are currently recruiting participants for a CARDIO survey to assess perceptions of our current data management infrastructure and capacity; it’s likely we’ll be following this up by interviewing selected respondents; (WP2 Requirements analysis)
  • Work has begun on our Roadmap for submission to EPSRC; (WP3.1 Implementation plan/roadmap)
  • Neil Beagrie is making good progress with interviewing stakeholders for the business case; (WP3.2 Sustainability and business model)
  • We will be running a hybrid training session and focus group on data management planning (DMP) with second-year DTC students in mid February; we are also planning how we can get feedback on DMPonline from students in the Centre for Digital Entertainment, given that they are based out in industry with their placement partners; (WP3.3 Data management planning/WP6 Liaison, training & advocacy)
  • A rough draft of our high-level data management policy has been produced and is now in the process of being refined; Cathy Pink will be attending the forthcoming workshop on policy development; (WP4 Policy development)
  • We are getting closer to appointing the systems developer who will help us develop and pilot an interface between our VRE, iSusLab, and a pilot repository via SWORD2; in the meantime, we will be making progress by planning how we will pilot Electronic Lab Notebooks (ELNs); (WP5.2 Research workflow & data deposit)
  • Work has begun on our data storage guidelines; (WP5.3 Data storage guidelines)
  • After a very valuable meeting in January, at which we learned more about CERIF and Pure (the Current Research Information System currently being implemented at Bath), we’ve clarified Deliverable 5.4 — more on this to follow.

VALA2012: libraries and technology down under

  , , ,

📥  Events

Liz Lyon gave a keynote speech on Wednesday 8 February 2012, entitled "The Informatics Transform: Re-engineering Libraries for the Data Decade", at the VALA2012 conference in Melbourne, Australia. The talk focused on the transformations required for libraries to keep up with digital trends, and drew on Liz's own experience for exemplars, including the University of Bath and the Research360 project.

VALA - Libraries, Technology and the Future Inc. (VALA) is "an Australian not-for-profit professional organisation that promotes the use and understanding of information and communication technologies across the galleries, libraries, archives and museum sectors." (via VALA on Wikipedia)