This is part one of a series of blog posts about best practices in data management for academic research situations. Many of the issues/principles I talk about here are less applicable to large scale tech/industry data analysis pipelines, as data in those contexts tend to be managed by dedicated database engineers/managers, and the data storage setup tends to be fairly different than in an academic setting. I also don’t touch much on sensitive data issues, where data security is paramount. It goes without saying that if a good data management practices conflict with our ethical obligations to protect sensitive data, ethics should win every time.
Good Data Management is Good Research Practice
Data are everywhere in research. If you are performing any sort of study, in any field of science, you will be generating data files. These files will then be used in your analyses and in your grant applications. Given how important the actual data files are, it is a shame that we as scientists don’t get more training in how to manage our data. While we often have excellent data analysis training, we usually have no training at all in how to organize data, how to build processing streams for our data, and how to curate and document our data so that future researchers can use it.
However, researchers are not rewarded for being excellent at managing data. Reviewers are never going to say how beautifully a dataset is documented and you will never get back a summary statement from the NIH commenting on the choice of comma delimited vs tab delimited files. Quite frankly, if you do manage your data well, your reward will be the lack of comments. You will know you did well if you can send a single link to a new collaborator to your data set, and they respond with, “Everything looks like it is here, and the documentation answered all of my questions.” So, given that lack of extrinsic motivation, why should you take the time and effort to learn and practice good data management? Let me illustrate a couple of examples of why good data management matters. All of the examples below are issues that arose in real life projects (albeit with details changed both for anonymity and to improve the exposition), and each one of them could have been prevented with better data management.
Reverse coded, or was it?
I once had the pleasure of working on a large scale study that involved subjects visiting the lab to take a 2 hour battery of measures. This was a massive collection effort, which fortunately, as is so often the case with statisticians, I got to avoid, and come in when the data was already collected. This lab operated primarily in SPSS, which, for those not familiar with this software, is a very common statistical analysis software used throughout the social sciences. For many, many years, SPSS was the primary software that everybody in psychology was trained on, and to its credit, is quite flexible, has many features, and easy to use. The reason it is so easy to use however, is that it is a GUI based system, where users can specify statistical analyses through a series of dialog boxes. Layered under this is a robust syntax system that users can access, however this syntax is not a fully featured scripting language like R, and is, to put it mildly, difficult to understand.
In this particular instance, I was handed the full dataset and went about my merry way doing some scale validation. But then I ran into an issue. A set of items on one particular measure were not correlating in the expected direction with the rest of the items. In this particular measure, these items were part of a subscale that was typically reverse coded. The issue was, I couldn’t determine if the items had already been reverse coded! There were no notes, and the person who prepared the data couldn’t remember what they did, and couldn’t find any syntax. Originally, I was under the impression that the dataset I was handed was completely raw, but as it turns out it had gone through 3 different people, all of whom had recomputed summary statistics, scale scores, and made other changes to the dataset. Because we couldn’t determine if the items were reverse scored, we couldn’t use the subscale, and this particular subscale was one of the most important ones in the study (I believe it was listed as one of the ones we were going to analyze in the PI’s grant, which meant we had to report those results to the funding agency.)
After a solid month of trying everything I could to determine if the items were reverse scored or not, I ran across a folder from a graduate student that had since left the lab. In that folder, I found a SPSS syntax file, which turned out to be the syntax file used to process this specific version of the dataset. However, the only reason I determined that, is because at the end of the syntax file, the data was output to a file named identically to the one I had.
Fortunately, this story had a happy ending in terms of data analysis, but the journey through the abyss of data management was frustrating. I spent a month (albeit on and off) trying to determine if items were reverse coded or not! That was a great deal of time wasted! Now, many of you might be thinking, why didn’t I go back to the raw data? Well, the truly raw data had disappeared, and the dataset I was working with was considered raw, so verifying against the raw data was impossible.
What I haven’t mentioned yet, is that this was my very first real data analysis project, and I was very green to the whole data management issue. This was a formative experience for me, and led me to switch entirely over to R from SPSS, in part to avoid this scenario in the future! This situation illustrated several violations of good data management practices (these will be explained in depth in a future post):
The data violated the Chain of Processing rule, in that nobody could determine how the dataset I was working with was derived from the original raw data.
It violated the Document Everything rule, in that there was no documentation at all, at least not for the dataset itself. The measures were well documented, but here I am referring to how the actual file itself was documented.
The data management practices for that study as a whole violated the Immutable Raw / Deletable Derivatives rule, in that the raw data was changed, and if we had deleted the data file (which was a derivative file) I was working with, we would have lost everything.
It partially violated the Open Lab Access rule, in that the processing scripts were accessible to me, but were in the student’s personal working directory, rather than saved alongside of the datafile itself.
This particular case is an excellent example of data rot. Data rot is what happens when many different people work with a collection of data without a strong set of data management policies put in place. What happens is that, over time, more and more derivative files, scripts, subsets of data files, and even copies of the raw data are created as researchers work on various projects. This is natural, and with good data management policies in place, not a problem overall. But here, data rot led to a very time consuming situation.
Data rot is the primary enemy that good data management can help to combat, and it is usually the phenomena that causes the largest, most intractable problems (i.e. nobody can find the original raw data). It is not the only problem that good data management practices can defend against, as we will see in the next vignette.
Inconsistent filenames make Teague a dull boy.
I am often asked to help out with data kludging issues people have, which is to say I help collaborators and colleagues get the data into a form that they can work with. In one particular instance, I was helping a colleague compute a measure of reliability between two undergraduate’s data cleaning efforts. I was given two folders filled with CSV files, a file per subject per data cleaner, and I went about writing a script to match the file names between both folders, and then to compute the necessary reliability statistics. When I went to run my script, it threw a series of errors. As it turns out, sometimes one data cleaner would have cleaned one subject’s datafile, while the other data cleaner missed that subject, which is expected. So I adjusted my script and ran it again. This time it ran perfectly, and I sent the results over to my colleague. They responded within 20 minutes to say that there were far fewer subjects with reliability statistics than they expected and asked me to double check my work. I went line by line through my script and responded that, given that the filenames were consistent between both data cleaners, my script picked out all subjects with files present in both cleaner’s folders.
Now, some of you readers might be seeing my mistake. I assumed that the filenames were consistent, and like most assumptions I make, it made a fool out of me. Looking at the file names, I found cases that looked like:
s001_upps_rewarded.csv vs. s001_ upps_rewarded_2.csv
Note the space after the first underscore and the _2 in the second file name. These sorts of issues were everywhere in the dataset. To my tired human eyes, they were minor enough that on a cursory examination of the two folders, I missed them (though I did notice several egregious examples and corrected them). But to my computer’s unfailingly rigid eyes, these were not the same file names, and therefore my scripts didn’t work.
The reason this happened was because when this particular data was collected the RA’s running the session had to manually type in the filename before saving it. Humans are fallible, but can adjust for small errors, while computers will do exactly what you tell them to. In my case, there was nothing wrong with the script I wrote, it did exactly what I wanted it to do, ignore any unpaired files. The issue was that there was no guarantee to the file structure. So what data management principles did this case violate?
Absolute Consistency: The files were inconsistently named, which caused issues with how my script detected them.
Automate or Validate: The files were manually named, which means that there was no guarantee that they would be named consistently. Additionally, there was no validation tool to detect any violations of the naming convention (there is now, I had to write one).
Now, this was not a serious case. I didn’t spend a month of time fixing this issue, nor was it particularly difficult to fix. I did have to spend several hours of my time manually fixing the filenames, and any amount of time spent fixing data management issues is time wasted. This is because all data management issues can be preempted, or at the very least minimized, by using good data management principles.
Lift your voices in praise of the sacred text fmri_processing_2007.sh
In addition to behavioral data, much of my work deals with neuroimaging data, and in fact, many of the ideas in this post came out of those experiences. Neuroimaging, be it EEG, MRI, FNIRS or some other modality they invented between the time this was posted and you read it, produces massive amounts of files. For example, one common way of storing imaging data is in DICOM format. DICOMS are not single files, but rather collections of files representing “slices” of a 3D image, along with a header that contains a variety of meta-data. There might be hundreds of files in a given DICOM, and multiple DICOMs can be in the same folder. This is not necessarily an issue, as most software can determine which file goes with which image, but now imagine those files, their conversions into a better data format, associated behavioral data (usually multiple files per scan, with usually multiple scans per person), and you can get a sense of my main issue with neuroimaging data: it can be stored in an infinite number of ways.
When I first started working with neuroimaging data, I was asked to preprocess a collection of raw functional MRI scans. Preprocessing is an important step in neuroimaging, because a) corrects for a variety of artifacts and b) fixes the small issue of people having differently shaped brains (by transforming their brain images into what is known as a standard space). Preprocessing fMRI images has quite literally thousands of decision points, and I wanted to see how the lab I received the data from did it. They proceeded to send me over a shell script titled fmri_processing_2007.sh. The 2007 in the file title was the year it was originally written. This occurred in 2020. The lab I was collaborating with was using a 13 year old shell script to process their neuroimaging data.
As aghast as I was, I couldn’t change that fact, so I took the time to try to understand what processing steps were done, and I set the script running on my local copy of the dataset. It failed almost immediately. I realized that I had made the mistake of fixing what I considered issues in file names and organization, though I did attempt to do so in a way that wouldn’t break the script. After fixing the processing script, I managed to run it and it completed processing successfully.
At the same time as this, I was working with a different neuroimaging group, and they requested processing code to run on their end. I sent over my modified script, as it was the only processing script I had on hand, and I felt like I had made it generalizable enough it should have handled most folder structures. I was severely mistaken. My folder structure looked something like this:
/project /fMRI/ s001_gng_1.nii.gz … /MPRAGE s001_mprage.nii.gz …
While the other labs folder structure looked like:
/project/ s001/ fMRI/ gng_1.nii.gz MPRAGE/ mprage.nii.gz
I had written my script to assume the first component of a file name was the subject ID, which it was in my data. In the other lab’s data however, their subject IDs were specified at the level of the folder. Obviously my script would not work without substantial alteration. I don’t think they ever did make those alterations.
There are two good data management principles violated here:
Redundant Metadata: In the case of the other lab, the file names did not contain the subject information. What would have happened if those files were ever removed from the subject’s folder?
Standardization: This is more of a goal rather than a principle. Imagine if both I and the other lab had used a standardized way of storing our data, and written our scripts to fit. We would have been able to pass back and forth code without an issue, and that would have saved us time and trouble.
Neither data rot nor human fallibility were to blame for these issues. In fact, both datasets were extremely consistently organized, and there were no mistakes with naming. We simply didn’t use the same data structure and it is worthwhile to ask, why? In this case, it was a simple case of inertia. Both myself and the analysts at the other lab had scripts written for a given data structure. In my case, the scripts I had were handed down from PI to PI for years, until the original reason certain data design decisions were made faded from memory. I like to term this, the sacred text effect. This usually occurs with code or scripts, but can occur with any practice. Usually the conversation goes like this:
You: Why is this data organized this way?
Them: Because that is how my PI organized data when I was in graduate school, and besides all of our analysis scripts are designed for this data structure.
You: Would you consider changing over to a more standardized data structure? There are several issues with the current structure that would be easily fixable, and if we use this standard, we can share the data more freely as well as use tools designed for this data structure.
Them: Sure, I guess, but could you fix our current scripts to deal with the new structure?
Suddenly you signed up for more work! It is vital that labs do not get locked into a suboptimal data management practice simply due to inertia. If a practice doesn’t work, or causes time delays, take the time to fix it. It might take time now, but you will make up that time 10 fold. A great example of this, and a major inspiration for this work is the BIDS Standard, a data structuring scheme for storing neuroimaging data.
These three cases illustrate the consequences of bad data management, but there are many more examples I could write about. To adapt a common idiom about relationships:
But again, it is important to point out that this is not due to any incompetence on the part of researchers. I’ve spoken and worked with many researchers who do not have the same technical background I do, and each one recognizes the issues inherent with bad data management practices (and can usually come up with a startling number of examples from their own work). It is simply that they were never trained in good data management, so they have had to figure everything out on their own, and these are very busy people. In the next post, I’ll lay out what I see as 8 principles of good data management for researchers. These principles are based out of my experience in the social and biomedical sciences, and so they might not wholly apply to, for example, database management in a corporate setting.