This post emerged from discussions at the JISC MRD Hack Days, particularly with Joss Winn of the University of Lincoln’s Centre for Educational Research and Development. The event brought together developers and data management experts for two intensive days to discuss and prototype tools for research data management.
Joss has also written a more discursive post about our discussions of file synchronisation, particularly with respect to handling of large files.
For a bit of context, both Joss and I make regular use of Dropbox and Git where appropriate.
The problem
Many researchers store the majority of their live data on local disks, with little or no redundancy, leaving them open to data loss through accident or theft. To solve this problem, we provide research users with high resilience, high performance, high capacity network storage, but in spite of these advantages, they often don’t use it as well as they might.
Another requirement is for easy sharing of files. Most data sharing still takes place via email.
The main reason for this is that many researchers do a lot of work on their laptops, in locations where their network connection may be intermittent, slow or completely sent. On the train, on a plane, in a cafe, they need access to some or all of their data wherever they are.
When faced with this problem, many researchers turn to Dropbox because it is easy to use and requires no user interaction beyond the initial setup. However, there are serious issues with using Dropbox to store research data, primarily the fact that confidential data is being stored on servers outside the institution’s control.
What is needed is a tool to transparently synchronise local and network storage, effectively providing an offline cache which provides the convenience and speed of local disk access combined with the resilience of network attached storage.
Desired features
Control over storage locations
For confidential research data, it is highly desirable for all copies to remain under the control of the institution(s) who are responsible for looking after it. Any solution should at least have the option of storing data on a university-run storage service.
Encrypted network transfer
Control of the storage locations on their own is not sufficient. If data is sent over the internet with weak (or worse non-existent) encryption, it can easily be intercepted by an attacker. Strong encryption should be used to protect all data.
Large file support
Many of the files which researchers routinely work with may be tens or hundreds of megabytes, or in some cases gigabytes or terabytes. Clearly, there’s a limit to this – it’s reasonable to expect that researchers will have to manage files of a gigabyte or more differently. But a suitable solution should at least work well for files of tens or hundreds of megabytes.
Two important factors spring to mind here. First, the user needs feedback about the progress of a sync so that they aren’t surprised when changes they were expecting haven’t propagated yet. Second, the tool needs to gracefully handle a user cancellation or a dropped connection without losing or corrupting data. Ideally, if this happens it should be able to resume where it left off.
Conflict resolution
Once you have two copies of your files, you have two different places to modify them, giving you the possibility of making different changes to the same file prior to synchronising. This becomes even more likely when you are sharing the same files between multiple users.
Metadata
Storing metadata along with data for later (perhaps automated) deposit in a repository is a core research data management practice. It tends to be readily available only when the data is created, but often only useful when data is finally published or archived. With a dedicated “Academic Dropbox”, it may be possible for users to associate metadata directly files at creation time, and then keep that metadata with the file throughout its life through to deposit in an archive.
Existing solutions that might work
Here is a whistlestop tour of some of the options we dug up. I’ve listed the pros and cons as I see them (though feel free to correct/update me in the comments), and I’ve not commented on features for which I couldn’t find enough information to judge.
Unison
http://www.cis.upenn.edu/~bcpierce/unison/
I (Jez) use this daily.
Pros
- Excellent handling of conflicts (choose which copy to use, or merge the two where possible)
- Allows synchronisation of arbitrary pairs of folders, so will work with any storage that can be mounted by the user
- Gracefully handles interruptions of the transfer, and restarts next time from where it left off
- Uses the rsync protocol to transfer only the parts of files which have changed
- Can transfer files over an encrypted SSH connection
Cons
- Requires initial configuration by the user: in Mac or Linux this can only be done by editing configuration files, though the Windows client has a
- Needs to be explicitly run by the user, which is easy to forget; could be run on a schedule, but this typically requires
- For large sets of files, scanning for changes can take a long time, particularly over the network
Rsync
Pros
- Allows synchronisation of arbitrary pairs of folders, so will work with any storage that can be mounted by the user
- Very flexible, and operates without user interaction, so can be adapted to fit many situations
- Uses the rsync protocol to transfer only the parts of files which have changed
- Can transfer files over an encrypted SSH connection
Cons
- Syncs in one direction only, so full synchronisation requires two runs, one in each direction
- Command line tool has many complex options, and the available GUIs only go a small way to improve this, so it can be difficult for non-technical users to understand
Git and other distributed version control systems (DVCS)
Both Joss and I (Jez) use this regularly.
Pros
- Can transfer files over an encrypted SSH or HTTPS connection
- Outstanding conflict resolution by intelligent merging of files
- Support for common software development activities such as branching (e.g. to make experimental changes)
Cons
- All actions require manual running of commands, either via a command line or a GUI, so requires quite a major change to the user’s workflow
- Merging only works well for text-based file formats, though it is possible with some work to use alternative merge tools for, say, Word documents
- Poor handling of large binary files generally, although extensions are available to mitigate this (see below)
Large file extensisons to DVCS
E.g. git-bigfiles http://caca.zoy.org/wiki/git-bigfiles, git-media https://github.com/schacon/git-media, git-annex http://git-annex.branchable.com/, mercurial large files extension http://mercurial.selenic.com/wiki/LargefilesExtension, and also Boar https://code.google.com/p/boar/
Boar is a VCS designed specifically to work with large files, while the others are extensions to existing VCS systems.
Pros
- Similar to DVCS (above), but vastly improved handling of large binary files (reduced memory requirement, for example)
Cons
- Similar to DVCS, but requires additional configuration
SparkleShare
Pros
- Very little configuration required to achieve similar results to Dropbox
- Can store data anywhere a Git repository can be placed but there is potential to build alternative storage backends
- Git features not exposed by the SparkleShare interface can be accessed using other git-based tools
Cons
- Software is very new and seems unstable (it crashed a few times for me under Mac)
Sharebox
https://github.com/chmduquesne/sharebox-fs
Pros
- Implemented as a filesystem, so completely transparent to the user once installed
- Uses git as a backend, so shares many of its advantages, including the ability to transfer data in encrypted forms
Cons
- Still very early in development so difficult to get working and only available on Linux
Oxygen Cloud
Pros
- Commercial offering with enterprise support available
- End-to-end strong encryption to protect confidential data
- Option to use your own (institutional) storage instead of the provided cloud storage
- Access via iOS and Android smartphones
Cons
- Enterprise service with commercial pricing
Summary
There’s no simple solution to this, but we now have a whole range of things to try and to suggest that our users try. Who knows, some of them might even work!

SpiderOak would provide a nice strong solution (as close to the lovely ease-of-use DropBox experience as possible), whilst improving upon the perceived security flaws of DropBox.
Although, I’d question how relevant these ‘flaws’ are, *realistically* to most academic research data. If one was doing medical research with patient data, then yes – DropBox is not appropriate. BUT I perceive such sensitive data types to be relatively rare in academia. My dinosaur evolution data does not need such extreme data security precaution. DropBox is perfectly safe and appropriate (as it is the most frictionless, easy to use solution, near fully-featured service) for most research data IMHO.
Thanks for that Ross, SpiderOak is another good option (although there’s always the question of whether you should trust someone else to encrypt your data for you).
It’s worth noting that personally sensitive information (in the sense of the Data Protection Act) is very common across the social sciences as well as in medical research, and is only one of a number of reasons why you may need to consider the security of data.
In many science and most engineering disciplines, it’s common for research to involve industrial partners who may have a right to first exploitation of the results, or may be providing confidential information about their own business. One of the University of Bath’s particular strengths is its relationship with industry and our partners have to be able to trust that we can protect their investment or they won’t continue to invest.
Even when no industrial partner is involved, the researchers and the university may want to apply for a patent on some innovation, which will not be granted if the knowledge has already made it into the public domain, even accidentally. The RCUK recognises this in its Common Principles for Data:
I personally feel that only the research has enough context about their own data to make an appropriate decision on this, and I hope we can support this by providing them with sufficient information about all of the available tools so they can make an educated choice.