Data structures for Open Science

Feb09

Data structures for Open Science

In: own data • Tags: data structure, metadata, open data, open science

For the last few years, we have been working on the development of new Drosophila flight simulators. Now, finally, we are reaching a stage where we are starting to think about how to store the data we’ll be capturing both with Open Science in mind, but particularly keeping in mind that this will likely be the final major overhaul of this kind of data until I retire in 20 years. The plan is to have about 3-5 such machines here and potentially others in other labs, if these machines remain as ‘popular’ as they have been over the last almost 60 years. So I really want to get it right this time (if there is such a thing as ‘right’ in this question).

Such an experiment essentially captures time series with around 70-120k data points per session, where about 3-6 variables are stored, i.e., a total of at most ~500-800k table cells per session, each with 8-12bit resolution. There will be at most about 8-16 such sessions per day and machine, so we’re really talking small/tiny data here.

Historically (i.e., from the early 1990s on), these data were saved in a custom, compressed format (they needed to fit on floppy disks) with separate meta-data and data files. We kept this concept of separated meta-data from data also in other, more modern set-ups such as our Buridan experiments. For these experiments, we use XML for the meta-data files (example data). One of our experiments also uses data files where the meta-data are contained as a header at the beginning of the file with the actual time-series data below (example data). That of course makes for easy understanding of the data and makes sure the meta-data are never separated from the raw data, i.e., less potential for mistakes. In another, newer, experiment we are following some of the standards from the Data Documentation Initiative (no example data, yet).

With all of these different approaches over the last two decades, I thought I ought to get myself updated on by now surely generally agreed on conventions for data structure, meta-data vocabularies, naming conventions, etc. I started looking around and got the impression that the different approaches we have used over time are still being used and then some new ones, of course. I then asked on Twitter and the varying responses confirmed my impression that there isn’t really a “best-practice” kind of rule.

Given that there was quite a lively discussion on Twitter, I’m hoping to continue this discussion here, with maybe an outcome that can serve as an example use case someday.

What do we want to use these data for?

Each recording session will be one animal experiment with different phases (“periods”) during the experiment, for instance some “training” sessions and some “test” sessions with experimental conditions differing between training and test. The data will be saved as time series data continuously throughout the experiment, so the minimal data would be a timestamp, the behavior of the animal and a variable (stimulus) that the animal is controlling with its behavior. Thus, in the simplest case, three columns of integers.

The meta-data for each experiment has to contain a description of the columns, of course, as well as date and time at the start of the experiment, genotype of the animal, text description of the experiment, DOI of the code used to generate the data, sequence and duration of periods, temperature, and other variables to be recorded or set on a per session or per period level.

A dataset or small project can consist of maybe three to four groups of experiments, let’s say one experimental genotype and two control groups. Traditionally the way we handled this grouping in most of our experiments, is to keep a text file in which the experimenter lists which file belongs to which group. That way, anybody can read the text file and get an understanding of the experimental design. The file also contains comments and notes about user observations during the experiment and a text description of the project. In a way, this text file is like a meta-data file for a data-set, rather than an individual experiment and thus should probably also contain some minimal mark-up. This text file is then read by either custom software or an R script to compile summary data for each group, e.g. means and standard errors of some variables we extract on a per period basis, plotted and compared between groups. As there are numerous ways to evaluate an animal’s behavior if we have the full time series, there is any number of different parameters one can want to extract from the data and plot/compare.

This is where the open science part would come in. Whenever the user runs the script that evaluates, plots and compares the data, the entire dataset is automatically made publicly accessible. Along with the dataset (raw data, meta-data and grouping text file), all the evaluations should also be deposited. Currently, we do this as a PDF file, but that is all but useless – only for human use. Ideally, I’d like this evaluation file to contain all the content of the grouping text file, as well as the DOI of the script that generated it and (semantic?) markup that structures the evaluation document. Such an evaluation document would be both machine and human (with a reader, which is why we started by using the PDF format) readable and provide an overview of exactly what was done to what data.

One eventual goal is to also use these evaluation documents during manuscript authoring. Instead of copying the figures, pasting them into a manuscript and then trying to describe the statistics, I’d like to just link the different evaluations from inside the manuscript. Each figure in a manuscript would then just be a link to one of the evaluations in the evaluation document, the one I want readers to see so they can follow my line of arguments. Any reader who wants to see other aspects of the data has single-click access to the entire evaluation document, with all our evaluations for this data-set, as well as access to all the code used to generate and evaluate the data, if they so wish. For this, all the data and meta-data in each dataset has to be linked to both each other, and the code and the text. Of course, all the data in a manuscript should also be linked together, even though they likely come from different datasets/projects.

With the data and code solutions we’re currently developing, this should allow us to just write code, collect data and link both into our manuscripts. Everything else (data management, DOI assignment, data deposition, etc.) would be completely automatic. Starting at the undergraduate student level, users would simply have to follow one protocol for their experiments and have all their lab-notebooks essentially written and published for them – they’d have a collection of these evaluation documents, ready to either be used by their supervisor, or to be linked in a thesis or manuscript.

So, what would be the best data structure and meta-data format with these goals in mind?

(Visited 47 times, 19 visits today)

Posted on February 9, 2017 at 23:50

Titus Brown

February 9, 2017, 23:55 | #

Let me just echo the recommendations to do things in a lightweight format like JSON – it’s convenient, readable by humans to some extent, and & parsers exist for many scripting languages.
- Björn Brembs
  
  February 10, 2017, 12:12 | #
  
  Thanks! So these look ok to you, then?
  https://wf4ever.github.io/ro/
  https://www.ddialliance.org/Specification/RDF/Discovery
Konrad Hinsen

February 10, 2017, 14:07 | #

The basic guidelines I like to give are

1) Choose text vs. binary

Do you absolutely need binary storage, either because of the size of the data or because of heavy I/O? If no, stick to text formats which are easier to work with. If you do need binary, consider HDF5, which is a portable and well-supported format. HDF5 adds a serious software dependency, so the question of long-time support is relevant. I used to say that if it’s OK for NASA, it’s OK for me, but in view of recent events even NASA’s long-time survival is becoming questionable.

2) Is human processing (reading/editing) or machine processing more critical to you?

Among text formats, some are very lightweight and thus more pleasant for humans to work with (CSV, JSON), whereas others (XML) have better support for machine processing and automatic validation. A main criterion in choosing between these categories is the complexity of your data structures (including metadata). For simple time series or key-value stores, JSON is fine. For highly structured data, XML plus a schema is much safer in the long run. JSON and CSV have ambiguities right from the start (no two parsers accept exactly the same files), and validation of storage conventions is more cumbersome to the point that most people don’t do any.

From your description it seems that text is OK for you, but I cannot judge the complexity of your metadata, so I won’t recommend either CSV/JSON nor XML.

For a longer discussion, see https://doi.ieeecomputersociety.org/10.1109/MCSE.2012.108 (also without paywall on ResearchGate: https://www.researchgate.net/profile/Konrad_Hinsen/publication/256373624_Caring_for_Your_Data/links/)
- Björn Brembs
  
  February 10, 2017, 17:06 | #
  
  This was extremely helpful! Thanks so much. I fact, I don’t know how “complex” our meta-data are going to be, either. The way I envision them, they’re not very complex: essentially a bunch of files grouped into experimentals and controls with one document specifying the relationship of the files to each other and the evaluations. Thus, probably just one, at most two hierarchical layers. Sounds simple to me…
  - Konrad Hinsen
    
    February 11, 2017, 18:50 | #
    
    Two more important points:
    
    1) Before deciding on a data *format* you should first define a data *model*, and document it. A data model maps your data items to basic data structures such as lists, trees, or key-value pairs, whose leaves are basic data types such as integers, strings, etc. At that level, you can think about how large your integers can become, and how much precision you need for floating-point numbers.
    
    Defining a data model first has three advantages. First, you make important decisions based on the needs of your data, rather than on the basis of what is easiest to do in some format such as JSON. Second, you can later map your data model to another format (e.g. if one day you need to switch to binary), and have simple loss-less translation between these formats. Third, it’s a valuable aid for others wishing to understand your data.
    
    For an example of a data model definition with subsequent mappings to two file formats, see MOSAIC (https://mosaic-data-model.github.io/), a data model for molecular simulations.
    
    2) If you design data models/formats for 20 years to come, consider evolving needs and make your data model and format(s) extensible. This criteria often leads to the choice of key-value pairs as data structures, because it is straightforward to add more keys as the need arises. MOSAIC provides an example for this as well.
    - Björn Brembs
      
      February 13, 2017, 14:38 | #
      
      Ah, this is very cool! A sort of centralized meta-data for what kind of data can be expected? I studied the GitHub link and the section on the data model in your paper, but I’m still not sure how this would work in practice. From what I understand, I’d publish a versioned document which contains the model. In our case, the model would contain all the possible experiments we are currently doing with the machines and the data we collect. Future versions would be created in which new experimental settings would be described and new data types defined. Each metadata file (or the meta data for each data set?) would then refer to the URI of the currently applicable version of this document? Did I get this right? If so, what do you use for providing persistent URIs to the MOSAIC data model? Sorry if it’s in your paper and I didn’t find it!
    - Björn Brembs
      
      February 13, 2017, 14:54 | #
      
      Also, I see you have data values within XML tags. That would make things very unwieldy for us, as we have tens of thousands of data values in data matrices, so we’d have to refer to the data values from within the XML document.
      - Konrad Hinsen
        
        February 13, 2017, 17:43 | #
        
        Looks like you understood my MOSAIC paper very well! The data model is indeed versioned and each data file refers to a specific version. This would indeed ideally be done with a permanent URI but I didn’t yet get around to set up a viable scheme that is reasonably permanent. In fact, I don’t see any other way than reserving a domain and promising to maintain it indefinitely. With no organisation behind MOSAIC, that kind of permanence isn’t really worth much. If I get hit by a bus, the URI disappears a year later.
        
        Note that MOSAIC has another extensibility feature: conventions. The nice thing about conventions is that anyone can set one up, completely independently of the versioned releases of the data model itself. Roughly, conventions are to MOSAIC as DTDs or schemas are to XML.
        
        As for storing data in XML tags, that an unimportant implementation detail. It’s convenient for the kind of data I put there, and it makes the XML files more compact. Any tag could be replaced by a child element. The XML literature is full of debates of when to use tags vs. child elements. I cannot see any general agreement there.
      - Björn Brembs
        
        March 6, 2017, 17:11 | #
        
        Actually, I think it might be worthwhile to talk with a library to see if they would set up a data model repository. One could go there, find a model that’s really close to the one the person needs and just fork it. Should be fairly easy and very useful. Plus, if there ever were any uptake, it would set a standard of sorts…
Sakari Maaranen

February 12, 2017, 10:11 | #

In addition to considerations on the data format itself, please prepare the completed data sets for easy retrieval and verification by others. This means publishing cryptographic checksums (SHA-256 or stronger) and trusted timestamped (RFC 3161) digital signatures along with the data. This way, if others produce copies of the work for backup or mirroring purposes, they can still verify data integrity and authenticity.
- Björn Brembs
  
  February 13, 2017, 11:00 | #
  
  That sounds like very reasonable suggestions. The measuring devices will be offline during the experiment. Can you recommend automated systems we should implement to make sure we meet such requirements?
  - Sakari Maaranen
    
    February 13, 2017, 11:51 | #
    
    I am not currently aware of readily made easy-to-use tools that would automate the whole process, although I might give it some time and create something soon for another project. Meanwhile, the usual command-line tools are not that difficult to use and the commands can be scripted with relative ease.
    
    For example, Free TSA provides some instructions for how to create Time Stamp Requests:
    
    https://freetsa.org/
    
    This should be a best practice for all open data digital publishing imho, enabling trust for open mirroring. You don’t really need to sign them at the point of measurement. What matters is that others can reasonably verify the source an data integrity back to the point of publication.
    - Björn Brembs
      
      February 13, 2017, 14:40 | #
      
      Hmmm, from what I see there, I’d need to do this for every data file separately, that’s not really feasible as they are (and must remain, I think) simple text files. If both hash and TS are in a separate file, then the file could be easily replaced. Thus, I can currently not see a way to do this in at least a semi-automated fashion, but I’ll definitely see if I can implement some approximation at least. Very cool, thanks!
      - Sakari Maaranen
        
        February 13, 2017, 15:18 | #
        
        From the instructions that I linked above: “For multiple files, the general concept is that timestamping a single file that contains an aggregate list of fingerprints of other files, also proves that these other files must have existed before the aggregate file was created, provided that both the aggregate file and the referenced file are available during verification process.”
        
        You can use a single command like:
        
        sha256sum *.xml > checksums.sha256.txt
        
        This creates a single file that contains checksums of all the files in the current directory. Then you only need to timestamp that single file and it serves to authenticate all your data.
      - Konrad Hinsen
        
        February 13, 2017, 17:48 | #
        
        I wonder if anyone has explored the possibility of using Git for providing a checksum and time stamp for data. You would keep all data in a Git repository, and do a commit when new data gets added, possibly automatically in a script.
        
        Git actually does very sophisticated data management in its repositories, based on the principle of content-addressed storage. I am sure there are lots of good uses for this that are unrelated to version control.
Pierre dB

February 16, 2017, 15:29 | #

Peter Colberg uses git to manage the *metadata* of his datafiles. He made a seminar on it a few years ago. The seminar slides are here https://wiki.scinet.utoronto.ca/wiki/images/5/55/Snug-git-annex.pdf and mention git-annex as molecular simulation files are a bit large for typical git usage. The idea remains I guess.
Pierre de Buyl

February 16, 2017, 15:31 | #

Peter Colbeg has seminar slides about this, using git-annex for large binary files (molecular simulations). https://wiki.scinet.utoronto.ca/wiki/images/5/55/Snug-git-annex.pdf

The idea is similar to what Konrad suggests, if I understand correctly.