What is the difference between text, data and code?In: science politics
tl;dr: So far, I can’t see any principal difference between our three kinds of intellectual output: software, data and texts.
I admit I’m somewhat surprised that there appears to be a need to write this post in 2014. After all, this is not really the dawn of the digital age any more. Be that as it may, it is now March 6, 2014, six days since PLoS’s ‘revolutionary’ data sharing policy was revealed and only few people seem to observe the irony of avid social media participants pretending it’s still 1982. For the uninitiated, just skim Twitter’s #PLoSfail, read Edmund Hart’s post or see Terry McGlynn’s post for some examples. I’ll try to refrain from reiterating any arguments made there already.
Colloquially speaking, one could describe the scientific method somewhat shoddily as making an observation, making sense of the observation and presenting the outcomes to interested audiences in some version of language. Since the development of the scientific method somewhere between the 16th century and now, this is roughly how science has progressed. Before the digital age,it was relatively difficult to let everybody who was interested participate in the observations. Today, this is much easier. It still varies tremendously between fields, obviously, but compared to before, it’s a world’s difference. Today, you could say that scientists collect data, evaluate the data and then present the result in a scientific paper.
Data collection can either be done by hand or more or less automatically. It may take years to sample wildlife in the rainforest and minutes to evaluate the data on a spreadsheet. It may take decades to develop a method and then seconds to collect the data. It may take a few hours to generate the data first by hand and then by automated processing, but decades before someone else comes up with the right way to analyze and evaluate the data. What all scientific processes today have in common is that at some point, the data becomes digitized, either by commercial software or by code written precisely or that purpose. Perhaps not in all, but surely in the vast majority of quantitative sciences, the data is then evaluated using either commercial or hand-coded software, be it for statistical testing, visualization, modeling or parameter/feature extraction, etc. Only after all this is done and understood does someone sit down and attempts to put the outcome of this process into words that scientists not involved in the work can make sense of.
Until about a quarter of a century ago, essentially all that was left of the scientific process above were some instruments used to make the observations and the text accounts of them. Ok, maybe some paper records and later photographs. With a delay of about 25 years, the scientific community is now slowly awakening to the realization that the digitization of science would actually allow us to preserve the scientific process much more comprehensively. Besides being a boon for historians, reviewers, committees investigating scientific misconduct or the critical public, preserving this process promises the added benefit of being able to reward not only those few whose marketing skills surpass the average enough to manage publishing their texts in glamorous magazines, but also those who excel at scientific coding or data collection. For the first time in human history, we may even have a shot at starting to think about developing software agents that can trawl data for testable hypotheses no human could ever come up with – proofs of concepts already exist. There is even the potential to alert colleagues to problems with their data, use the code for purposes the original author did not dream of or extract parameters from data the experimentalist had not the skill to do. In short, the advantages are too many to list and reach far beyond science itself.
Much like the after the previous hypothetical requirement of proofs for mathematical theorems, or the equally hypothetical requirement of statistics and sound methods, there is again resistance from the more conservative sections of the scientific community for largely the same 10 reasons, subsumed by: “it’s too much work and it’s against my own interests”.
I can sympathize with this general objection as making code and data available is more work and does facilitate scooping. However, the same can be said of publishing traditional texts: it is a lot of work that takes time away from experiments and opens all the flood gates of others making a career on the back of your work. Thus, any consequential proponent of “it’s too much work and it’s against my own interests” ought to resist text publications with just as much fervor as they resist publishing data and code. The same arguments apply.
Such principles aside, in practice, of course our general infrastructure makes it much too difficult to publish either text, data or software, which is so many of us now spend so much time and effort on publishing reform and why our lab in particular is developing ways to improve this infrastructure. But that, as well, does also not differ between software, data and science: our digital infrastructure is dysfunctional, period.
Neither does making your data and software available make you particularly more liable to scooping or exploitation than the publication of your texts does. The risks vary dramatically from field to field and from person to person and are impossible to quantify. Obviously, just as with text publications, data and code must be cited appropriately.
There may be instances where the person writing the code or collecting the data already knows what they want to do with the code/data next, but this will of course take time and someone less gifted with ideas may be on the hunt for an easy text publication. In such (rare?) cases, I think it would be a practical solution to implement a limited provision on the published data/code stating the precise nature of the planned research and the time-frame within which it must be concluded. Because of its digital nature, any violation of said provisions would be easily detectable. The careers of our junior colleagues need to be protected and any publishing policy on text, data or software ought to strive towards maximizing such protections without hazarding the entire scientific enterprise. Also here no difference between text,data and software.
Finally, one might make a reasonable case that the rewards are stacked disproportionately in favor of text publications, in particular with regard to publications in certain journals. However, it almost goes without saying that it is also unrealistic to expect tenure committees and grant evaluators to assess software and data contributions before anybody even is contributing and sharing data or code. Obviously, in order to be able to reward coders and experimentalists just as we reward the Glam hunters, we first need something to reward them for. That being said, in today’s antiquated system it is certainly a wise strategy to prioritize Glam publications over code and data publications – but without preventing change for the better in the process. This is obviously a chicken-and-egg situation which is not solved by the involved parties waiting for each other. Change needs to happen on both sides if any change is to happen.
To sum it up: our intellectual output today manifests itself in code, data and text. All three are complementary and contribute equally to science. All three expose our innermost thoughts and ideas to the public, all three make us vulnerable to exploitation. All three require diligence, time, work and frustration tolerance. All three constitute the fruits of our labor, often our most cherished outcome of passion and dedication. It is almost an insult to the coders and experimentalists out there that these fruits should remain locked up and decay any longer. At the very least, any opponent to code and data sharing ought to consequentially also oppose text publications for exactly the same reasons. We are already 25 years late to making our CVs contain code, data and text sections under the “original research” heading. I see no reason why we should be rewarding Glam-hunting marketers any longer.
UPDATE: I was just alerted to an important and relevant distinction between text, data and code: file extension. Herewith duly noted.