Eight Principles of Good Data Management

This post is the second in a series of posts about data management practices (see the introduction here). Before I get into talking about my principles of good data management, I want say I found out after my previous post that librarians, and library science as a field, have been thinking and writing about data management for years now. Many university libraries have programs/seminars where the librarians will consult with you about data management. This is a wonderful resource, and if my own experience is common, very underutilized.  So, if you are at a university, check out your library system! 


In today’s post, I walk through 8 principles of good data management. These are wholly informed by my own experiences with data management and analysis, and I wrote these with the following in mind: When it comes to data management, you are your own worst enemy. I’ve lost count of the times I start complaining about some aspect of a dataset,  I check who made the relevant change, and it turns out it was me a week ago. So these are principles that I try to follow (and oftentimes don’t quite) to protect my work against mistakes, to save time, and to make it easier to collaborate with others. This is, of course, not an exhaustive list, nor likely the absolute best way of managing data, but I’ve found them to be helpful personally.

Document Everything

This principle should be fairly self explanatory. There should be well formatted documentation for every bit of data in a dataset. Seems simple enough, but in practice documentation tends to fall by the wayside. After all, you know what type of data you collected right? But good documentation makes everyone's work so much simpler, so what makes for a good set of documentation? For me, good documentation follows these guidelines:

  • It references files/variable names. A codebook is documentation, but we can think of any sort of data documentation as codebook-esque. It is not useful to simply say: “The UPPS was collected.” I need to know that the UPPS is in the behav_baseline.csv, and is labeled along the lines of “upps_i1_ss” (UPPS, item 1, sensation seeking subscale).

  • It’s complete. First, there shouldn’t be any bit of data that is not described by some sort of document. If you have, for example, a screenshot of a behavioral test result (which is a real thing, the test in question was proprietary, and the only way of storing a record of the score the participant got was to take a screen shot of the reporting page!), as your raw data, then these files need to be described, even though all the test results have presumably been transcribed into a machine readable file. 

This also holds for item text (and response option text), or descriptions of biological assays, or even neuroimaging acquisition details. The codebook should contain all the information about how the data was collected in a easily human-readable format. The time that an analyst spends hunting through Qualtrics surveys, or reading through scanner protocol notes is less time they spend actually analyzing your data.

  • It’s cross-referenced: For the love of science, please put in the correct references when you describe a measure in your codebook! This makes it much easier to prepare an analysis for publication. Additionally, if certain measures are repeated, or have different varieties (child, parent, teacher version), make sure that these are all linked together. The goal here is to make it easy for an analyst to understand the structure of the data with respect to the analysis they are trying to perform, and to make it easier for them to prepare that analysis for eventual publication.

  • It’s thorough: This is not the same as being complete. Thoroughness is more about how aspects of the data are described. Technically the following is a complete description:

    • UPPS Child Scale (Zapolski et al., 2011)

But it doesn’t tell you anything about that measure. A more thorough description would be: 

  • UPPS-P Child Scale (Zapolski et al., 2011): Self report impulsivity scale with 4 subscales: Negative urgency (8 items), perseverance (8 items), premediation (8 items), sensation seeking (8 items). Items are a 1-4 likert scale. Child version is a modified version of the adult UPPS (Whiteside & Lynam, 2001)

This description tells me what the scale is, what it has in it, and what to expect about the items themselves. It’s also cross referenced, with historical information. It doesn’t go into the meaning of each subscale, that wouldn’t be within scope of a codebook, but it provides meaningful information for any analyst.

  • It’s well indexed: Give us a table of contents at the very least. I don’t want to have to flip through the codebook to find exactly what I need. The ability to look for, say, baseline child self report measures, and see that they start on page 20, just makes my job much easier. 

  • It describes any deviations from the expected: Say you modified a scale from a likert of 1-5 to 1-7. That needs to be noted in the documentation, else it can cause big issues down the line. On the other hand, if you used a scale as published, you just need to provide the minimal (but thorough!) description. 

When writing a codebook, one needs to remember, you are not writing this for yourself. You are writing it for somebody who has never seen this data before (which also applies to you, 2 weeks after you last looked at the data). What do they need to know?


Chain of Processing

Very few data management issues are worse than not knowing what somebody did to a piece of data. It literally makes it unusable. If I don’t know how a fMRI image is processed, or how a scale was standardized, I cannot use it in an analysis. On the other hand, if I have documentation as to what happened, who did it, and why, I can likely recover the raw form, or, at the very least, evaluate what was done. 

This principle is obviously inspired by the idea of a “chain of custody” in criminal investigations. My (admittedly lay-person) understanding of this principle is that for evidence to be considered in a trial, there needs to be a clear record of what was done to it and by who, from the moment the piece of evidence was collected to the moment the trial concludes. This protects everybody involved, from the police (from accusations of mishandling) to the accused (from actual mishandling). Similarly, this idea applied to data management protects both the analyst and the analysis at hand.

Describing the chain of processing can be done in multiple ways. I am in favor of a combined scripting/chain of processing approach, where I write processing scripts that take raw data, process it, and return either data ready to be analyzed, or the results of an analysis themselves. In this case, the script itself shows the chain of processing, and anybody who looks at it will be able to understand what was done in a given case (if I’ve written my code to be readable, which is always a dicey proposition). Another way is the use of changelogs.  These are text (or some equivalent machine/human readable file like a JSON) files that analysts can use to note when they make changes to any aspect of the data. Sometimes changes need to be done by hand, like when the data requires cleaning by hand (i.e. psychophysiology data), and the changelog would need to be manually updated. Other times these changelogs can be created by the scripts used to process the data. 

This is such an important principle to follow that I will say, I would prefer a badly written changelog or a hard to read script to no chain of processing at all.  


Immutable Raw / Deletable Derivatives

Imagine the case where a scale has been standardized. This is a fairly common procedure, subtracting the mean and dividing by the variance. It makes variables that are on different scales comparable. Now imagine that you only have the standardized scale, and no longer have the raw data. This is a huge issue. Because you do not know what values were used to standardize the scale, you wouldn’t be able to add any more observations. 

Okay, so that might be a trivial example. Let me mention an example that I’ve encountered many times. The processed neuroimaging files are available, but the raw images are not. Here, this is usually not due to the raw data being deleted, though that has occurred in my experience. 

If you don’t have the truly raw data, you cannot recover from any data management mistakes. This means that your raw data is precious. You might never analyze the truly raw data, I know I don’t, but it needs to be kept safe from deletion or modification. Ideally, your dataset can be divided into two broad sections. The first is the rawest data possible, images right off the scanner, a .csv downloaded straight from Qualtrics. Label it well, document it, and then set it to read only and never touch it again. The second half is your data derivatives. These are any bits of data that have undergone manipulation. If you have pulled out scales from that raw dataset, that is a derivative. If you have cleaned your raw physio data, derivative. Because you are presumably following the third principle, Chain of Processing, you know precisely how each derivative bit of data was created. As such, your derivatives are safely deletable. Now, it might not be wise to delete some derivatives, for example, if your physiological data was cleaned by hand (as much of it is), even if you know exactly how it was cleaned, given the time and effort you likely shouldn’t delete those derivative files. But if push came to shove, and those files were deleted, you would be able to recover any work you had done.

I delete derivatives all the time, because my workflow involves writing data processing scripts. In fact, I often don’t even produce actual derivative files, instead keeping any processing in memory so that when I go to run an analysis, I reprocess the data each time. Whatever way you do it, make sure your raw data is immutable, read only and backed up in multiple locations, and that you have a chain of processing to tell you how each derivative file was created. If both of those are in place, you can rest much easier when thinking about how your data is stored.


Automate or Validate

I’ve mentioned scripts several times so far, so it is not a surprise that scripting is one of my principles of good data management. This principle says, if you can automate a part of your processing, do that. However, oftentimes one can’t automate fully. In those cases, write scripts to validate the results of the non-automated processing. By validation I don’t mean checks to ensure the processing was done correctly, I mean checks to make sure that your files are stored correctly, and you don’t have any file format differences.

Why write scripts/validators? Because you are a weak, tired human being (presumably?). You make mistakes, you make typos, and you forget. Sure, you are clever, and can think of unique solutions to various data issues, but a solution is only useful when applied consistently. A computer on the other hand does only exactly what it is told to do, nothing more and nothing less. Take advantage of that quality! A script written that performs your data processing will process the same way each time, and acts as a record of what was done. But what about mistakes? Yes, you will make mistakes in your code, mistakes you don’t catch until later. I develop a processing pipeline for neuroimaging data (link!). In an early development build, I didn’t add a “-” in front of a single variable in a single equation. This led to an inverting of the frequency domain filter I was implementing, so instead of removing frequencies outside of .001-.1 mHZ, it removed frequencies within .001 -.1 mHZ. Fortunately, when I was testing this function this was simple to detect, and a couple of hours of tearing my hair out looking at my code for any errors led me to find the issue and correct it.

Contrast this with an experience a colleague had with their data. They were doing a fairly simple linear regression, and needed to merge data from two spreadsheets. Each spreadsheet looked identically ordered with respect to subject ID, so they copy and pasted the columns from one to the target spreadsheet. I’ve done this, we all have, I don’t fault them for it. We really shouldn’t be doing by-hand merges though. As my colleague realized the night before they were going to submit the manuscript, that in actuality, the first dataset was only ordered the same way for the first 100 or so observations. Then there was subject ID that was not in the second dataset. So, after the copy and paste, the first 100 observations were accurately matched, but after the first 100, all the observations were offset by one row. Visually, this wasn’t apparent until you scrolled to the bottom of a very long dataset, as there were no missing rows (which visually indicates these mismatch issues quite quickly). Statistically, this effectively equivalent to randomly reordering your variables column by column. A Very Bad Thing™. No wonder they had some very counterintuitive results! I am glad they found the issue before submitting that manuscript for publication, because if it was published with that mistake, it would need to have been retracted! 

So what happened? Well, my colleague did a very innocuous, commonly done bit of data processing, in a way they had done 100 times before. Just that this time, it led, really through no fault of their own other than the same momentary lapse of attention that afflicts most of us from time to time, to a retraction worthy mistake. A retraction worthy mistake that was nearly undetectable, and was only found because my colleague happened to scroll to the bottom of the spreadsheet while looking at some unrelated aspect of the data. 

Would scripting the data merge avoid this? Categorically yes. There are ways of messing up data merges when scripting, many ways, but in my experience those become apparent very quickly. This particular issue, there was a single additional observation in the first dataset, would have been completely avoided by scripting the merge. The scripted solution would also be more robust to other potential issue, for example, what if the ordering of the observations was completely different? Well, if you script the merge, you don’t even need to worry about that, the software will take care of it.

Validators are quite useful too, though I will admit I haven’t used/encountered many of them. The one I do use is the Brain Imaging Data Standard (BIDS) validator. BIDS is a standardized way of storing neuroimaging data (and an inspiration for this series of blog posts!), and the validator simply looks at the data structure to see if everything is where it needs to be. It flags files that have typos, and identifies where metadata needs to be added. Another validator I’ve written checks to make sure file names are structured the same for datasets of psychophysiological data, which requires cleaning by hand. This leads to typos in file names, because RA’s need to click Save As, and type out the name of the file. So, I run this validator before I do any additional data kludging, just so I know my scripts are going to get all the data I was sent.  Which is a great segue into my next principle: Guarantees.  


Guarantees

What are guarantees in the context of data management? Guarantees are simple: If I am looking at a dataset, I should be guaranteed that, if there is a certain file naming structure, or file format, etc, all relevant files follow that rule. Not most, not almost all, not all almost surely, but absolutely all relevant files. One way of guaranteeing your Guarantees is to use scripts to process your data. Guarantees are all about consistency, and nothing is more consistent than a computer running a data processing script. Validators are a way of verifying that your guarantee holds. 

But why bother? Why does it matter if the variable names are not consistent between scales? Or that mid study, the behavioral task output filename convention changed? Well, if I was doing analyses with a calculator (like the stats nerd I am), I would be able to adjust for small deviations. But I’m not going to do that (still a nerd), I write analysis scripts. And again, computers only ever do precisely what they are told to do. Guarantees are a way of simplifying writing new analysis scripts, or even new processing scripts. Here is an example: Consider two variable names: “UPPS_i1_ss_p” and “Conners_p_1”. I do quite a bit of text processing to create metadata when running analyses, and in this case, I might want to pull out item level means for every item in the UPPS and the Conners. But if I do a string split on “_”, and look for the item number in the second slot, well, in the UPPS, the second slot is “i1”, but in the Conners’, the second slot is “p.” I would have to make a modified version of my UPPS processing code to fit with the Conners. 

But what if my variables were guaranteed to have the following formatting?

“scale_i#_subscale_source” (with an indicator if there is no subscale). 

Then I can write a single script that pulls the necessary information from each variable name, and apply it to every scale in my dataset. It makes programming analyses much simpler, and reduces the need to check the codebook for every new scale.   

The main benefit of guarantees is that it reduces the cognitive load on the analyst. If I know that the file names have a standard structure, and that they were script generated, I can immediately relax and not be on the lookout for any deviations in file naming that might mess up my processing stream. Because of this, I can better focus on performing whatever processing correctly. In my experience, when one has to adapt code to little idiosyncrasies in the data structure, these adaptations are where the mistakes creep in. I’ve never written cleaner code than when I work with guaranteed data.


Open (Lab) Access

Science is fundamentally collaborative. Even if you are the main statistical analyst on a project, you will still need to access data that was generated or collected by other lab members. This brings up an annoying issue, that of data permissioning. There are two ways I’ve seen this issue come up. 

The first is a literal file permission problem. I work on a variety of neuroimaging projects, and, for a variety of historical reasons, neuroimaging data is usually processed on Linux workstations. One aspect of Linux, and particularly Linux on shared workstations, is that each file has a set of access permissions. In Linux, these permissions are the following: 1) can you, the file creator, read/write/execute the file? 2) can members of your user group read/write/execute the file? And 3) can anybody else read/write/execute the file? If you happen to be using a personal Linux machine (more power to you, I guess?), this is not an issue, as you can be fairly certain that the only person accessing files on your computer is you. But on a workstation this can become an issue, because if the permissions aren’t set correctly, other members of your lab won’t be able to access any files you have created. In neuroimaging this quickly becomes problematic, as each step in preprocessing creates temporary files. About 70% of issues I have encountered using various pipelines I have developed have ultimately come down to permission issues. 

Of course, fixing actual file permissions is a fairly simple thing to do. But, there is a more problematic “permissions” issue that often occurs in larger labs. I like to refer to this as a lab balkanization of data. This is when, due to internal lab politics, different bits of data are only accessible to certain investigators. One example of this, that I have personally experienced, is where I had access to the neuroimaging data from a large longitudinal study, but the self report data was not only not accessible by me, it was held by an investigator at a university half way across the country. To get access to this data, we had to request specific cases/variables, and this investigator would then send us just those records. 

Now before I start criticizing this practice, I will note that this sort of data access issue can happen for very good reasons (as was the case with the large longitudinal study). Oftentimes, if there is very sensitive information (think substance use self report in children), that data has additional security requirements. A good example of this is the ADDHealth national survey. This is a very large national survey which collected health behavior data on high schoolers, and one unique aspect of it is that there is social network information for many of the participants. Additionally, ADDHealth collected romantic partner data for participants, including if two participants were romantic partners. This, combined with the fact that one could, theoretically, easily identify specific participants based on the extensive self report data (deductive disclosure), means that this data needs to be specially protected. Well, to access the romantic partner data, an investigator needs to dedicate an entire room, that only they can access (not anybody in their lab, just the actual investigator), with a computer that has been chained to the room and has no internet access. There are a number of other requirements, but the one that made me laugh a bit is that if you were to store the data on a solid state drive (SSD), you are required to physically destroy the drive at the end of analysis. So there are a number of cases where restricting access to sensitive data is quite reasonable. 

That being said, I believe that a PI should make every effort to ensure equal access to all data for all analysts in their lab. This smooths the working process, and reduces mistakes due to miscommunication. When I am looking for data, I usually know exactly what I need, but I might not know exactly what form it takes. If I have access to all the data for a given study, I can hunt for what I need. If I have to ask another person to send me the data, I usually will have to go back and forth a couple times to get exactly what I need. 

So what are the reasons that this balkanization happens? Usually, there is no reason. Somebody just ran a processing script on their own desktop, and never put the file in a shared drive. Occasionally, balkanization can be subtly encouraged by competitive lab culture. Grad students might keep “their” data close to the chest because they worry that somebody might scoop their results. I’ll be blunt: Scooping within a lab should be impossible. If two lab members get into this sort of conflict, the PI is either ill-informed about what is happening in the lab, as they didn’t nip it in the bud, or come down hard on whoever was trying to scoop, or malevolent, in that they encouraged this behavior in an extremely misguided belief that competition, red in tooth and claw, makes for better scientists. It categorically does not. This balkanization can also occur at the same level of investigator, for example when two labs that have collaborated on a larger study divide the data between themselves. Personally, I find this to be ridiculous, as again, any concerns about who gets to publish what paper should be dealt with by dialogue, not by restricting access. But, admittedly, when this sort of divide happens, it is rarely resolved in the fashion I prefer (data pooled and everybody has equal access), simply due to investigator inertia/ego. 

To avoid issues with data access, data storage plans should be drawn up before the first subject is collected. These plans should indicate if there are any aspects of the data that are deemed sensitive that would require secure storage. Besides that, these plans should work to provide as equal and as full access as possible to any lab member who would, reasonably, be performing analyses. Who gets to write/publish a certain project should be negotiated openly and clearly. If this kind of transparency is encouraged, then questions about who has access to data quickly becomes irrelevant.  


Redundant Metadata

Metadata refers to data about data. A good example of this are scanner parameters for neuroimaging data. The data itself is the scan, while the metadata are the acquisition settings for that scan. In neuroimaging these are vital to know, as they tell you, among many other things, how fast the data was collected, what direction the scan was in, what the actual dimensions of the image are, etc, etc… In a more traditional self report survey, metadata could be the actual text of each item, along with what the text was for the response options. 

For multiple file datasets, such as ones where there are separate data files for each subject, a piece of metadata would be which subject is associated with each file. Metadata is obviously important, but oftentimes it is only stored in a single place at a time. Take this simple example, considering two directory structures:

/data/sub-01/behavioral/baseline.csv

/data/sub-01/behavioral/sub-01_behavioral_baseline.csv

Both directory/file combinations contain the same information: the file is the baseline behavioral data for subject 01. But the second combination has redundant information. Not only does the directory structure tell you that this is the behavioral data for subject 01, the file name itself reiterates this. Why is this useful? Well, say you want to analyze all the baseline behavioral data. You extract all the baseline data into a new directory. In the first case:

/newdir/baseline.csv

/newdir/baseline(1).csv

/newdir/baseline(2).csv

In the second:

/newdir/sub-01_behavioral_baseline.csv

/newdir/sub-02_behavioral_baseline.csv

/newdir/sub-03_behavioral_baseline.csv

In the first case, you’ve lost all identifying information, while in the second case, the important metadata is carried along in the file name. I know which case I would prefer to work with! While this scenario is a bit of a straw man, it does happen. I’ve seen it in neuroimaging datasets, where the subject is indicated only at the directory level, ala /sub-01/anat/mprage.nii.gz. In fact, this is/was a fairly common data structure, as certain neuroimaging software packages effectively incentived it.  

Metadata is tricky, because there is usually so much of it and you usually don’t know every piece you might need. So, store it all!


Standardization

So, you’ve decided to implement all the principles so far, and you’ve convinced your colleagues to implement good data management practices as well. Wonderful! Your analysis pipelines are humming along, your graduate students and postdocs have slightly less haunted looks in their eyes, and you feel that warm feeling that only well organized data can give you (no? Only me then?). 

With your new found confidence in your data, you decide to strike up a collaboration with a colleague. That colleague has also jumped onto the data management train, so you are confident that when they send you data it will be well organized and easy to use. 

So they send you their data! And it is beautifully organized! Beautifully organized in a completely different way than your data!

Well, now all of your data processing/analysis scripts will need to be rewritten. This might be much easier than normal, because of how your colleague’s data is organized, but it still takes time. So, how can we streamline this? 

Now we come to the final principle, to which all other data management principles lead: Standardization. On the face of it, this principle is fairly simple. Labs working with similar study designs/datasets should use a single standard data management setups, rather than many data management setups, no matter how well managed those setups might be. However, this is much easier said than done.

Different labs/projects/PIs both have different data requirements and likely use differing software tools. This leads to a proliferation of data management choices that make a single standardized data management schema that works for every possible case nearly impossible. The closest thing I have seen to complete standardization is the (previously mentioned) BIDS format, which is only possible because there are a limited number of data sources for neuroimaging data and a great deal of effort has gone into standardizing the low level file formats used in neuroimaging (e.g. Nifti files as a standard MRI storage format.) 

If universal data management standardization is impossible, what can be standardized? I think of datasets as puzzles made up of different modalities of data. Each modality represents a type of data that shares most characteristics. For example, I consider questionnaire data to be a single modality, as is fMRI data. Conceivably, a standard format for any questionnaire data could be developed (I would suggest pairing CSV files with a metadata JSON, but there are many other ways as well). I think standardization of different modalities of data is the right way of approaching this problem.

Even with restricting the scope to specific modalities of data, true standardization is difficult. So what can the individual researcher do? Well, first, and most importantly, researchers need to be talking about data management with colleagues and students. There is a tendency for PIs to abstract away from the nitty gritty of data management and data analysis, and while I understand the reason for that (grants don’t write themselves!), this inattention is one of the leading drivers of data rot. By working in data management into scientific discussions and project planning, I find it grounds the conversation and focuses it on the question of what can we do with what we have. From there, researchers should explicitly share their data management scheme with colleagues. If you’ve saved time by implementing good data management, then likely your colleagues would benefit from adopting what you’ve done. While this can be a bit of work, I’ve found that by emphasizing the timesaving aspects of good data management, otherwise very busy PIs become much more amenable to changing around the structure of their data storage.

Ideally, as data management is discussed and setups are shared, this would naturally lead to a type of standardization. Consider a fairly simple type of standardization, a well structured variable naming scheme: scale_i#_subscale_source. Changing variable names in a dataset tends to be very simple, and once researchers see how useful a standardized naming scheme can be, it can be quickly adopted. The key here is for researchers who are trying to bring good data management practices to the table to keep up the pressure. Researchers/scientists/academics, myself included, tend to have quite a bit of inertia with how we like to do things. We get into ruts, where the way we know how to do a task is so familiar and easy for us, we continue out of convenience. But the siren call of “it will save you time” is strong, and I’ve gotten the best results when pitching standardization by emphasizing the advantages, over pointing out what is going wrong.


Summary 

The above data management principles were derived from my own experiences working with all kinds of data and are not meant to be exhaustive or overly rigid. My goals when thinking about data management is: how do I protect my work from my greatest enemy, me from a week ago, and how do I save time and cognitive energy. I don’t like wasting time, and I don’t like to repeat work. That being said, all of these principles are well and good when you are setting up a new study, but what if you are currently working with a dataset? Or you are a new graduate student or postdoc, and you are being handed a dataset? You might want to start reorganizing the data immediately into a better management structure. I would urge caution though. Not only do PIs tend to not like their datasets being unilaterally reorganized by a new member of the lab (a scenario that I know nothing about, nothing at all), you also likely don’t know enough about the dataset in question to even begin to reorganize it. In order to efficiently and correctly reorganize the data, you need to understand what you have. You need to perform a data audit, which is systematic investigation of an existing dataset for the purposes of identifying:

  1. What should be in it. 

  2. What is actually in it. 

  3. How the data is currently organized.

  4. How it could be organized better.

In my next data management post, I’ll be walking through how I perform a data audit, and what I think needs to be in one. Thanks for reading!