Data Management Practices

Everything in Its Place: A Guide to Data Audits

After a bit of a hiatus due to starting at University of Virginia, I’ve finally sat down and written the next post in my series on Data Management for Social Scientists. For those of you who missed the first two, you can check them out here and here! As always, this guide is based off of my own experiences, and there are many ways to successfully and efficiently document a dataset. If I’ve missed anything, please feel free to let me know!


So, you have joined a new lab, started a new lab, or received a dataset from a collaborator, and you are looking forward to digging in. You quickly realize that because that new lab or that new dataset doesn’t look anything like what you are used to, you need to take time to better understand the data structure. This sounds like a good time to perform a Data Audit. Data auditing is a practice often used in corporate settings to evaluate the location, quality and consistency of their databases, with particular eye to how the data is being used. In an academic research setting, the overall goals of a data audit remain the same:

  1. Determine where the data are. In many cases, this is a simple question to answer. If a collaborator sends you a single CSV file with their data, you probably have a good idea where that data is, but only if the data are complete, which brings us to our next goal.

  2. Determine if the data are complete. Studies, particularly in the social or biomedical sciences and particularly when dealing with human subjects, have extensive study design documentation (this is almost always a requirement for getting ethics approval for human subjects studies.) This documentation tells you, the auditor, what should be in the data that you were directed to.

  3. Determine if the data can be used for its specified purpose. In most studies, data will be analyzed, and this typically requires it to be formatted in a particular way. If, for example, the study collected free form responses in the form of a collection of .txt documents, this is less amenable to quantitative analyses than if those freeform responses were collected into a single tabular data file. 

  4. Determine if the data follows good data management practices. It is one thing to identify where the data are, and if the data is complete. In some cases, that portion of the data audit can be scripted. It is another thing entirely to determine how data either follows good data management practices, or which data management principles the data structure violates.

The end goal of any audit is not to restructure the data set. I want to repeat that, you, as the auditor, should not be changing how the data is managed. This even applies to heads of labs that want to perform their own data audit. If you change a data structure without the full buy in with the rest of the team, you will cause problems and might even make the data structure worse. Refactoring data is a distinct process, albeit one that is informed by the results of a data audit. The end goal of a data audit is the data audit report. 


The Data Audit Report

A data audit report is a human readable document that describes the results of the data audit, identifies issues, and suggests a set of solutions. This is not scholarly work, and should be written as straight forwardly as possible. This is not a trivial requirement, as many of you who have been asked, or have planned a data audit, likely have more computer science/data management experience, and if you are not careful, might use more technical terminology then is useful. Remember, the goal of a data audit is not to create a document for you to reference (though this is a major advantage), it is to create a document that anybody can use to understand the dataset in question. Take for example the following scenario:

Scenario:

In performing a data audit of an longitudinal study, you find that the data from multiple timepoints are stored in wide format .SAV files. This makes them difficult to access using open source data analysis tools, and the wide format of the data makes it difficult to perform longitudinal modeling. You want to propose converting the master copy of the dataset to long format, writing a script that when run will produce a wide format datafile, and changing the file type to a common delimited file type, like a CSV. In your report you write:

Solution:

Convert wide to long, create reverse conversion script in R, change file format to CSV.

This is informative language, and if you handed me a report with that as a solution, I would be able to understand that. But that requires knowledge of wide/long formats and why one would use them, why would you create a reverse conversion script rather than simply create an additional copy of the data set, and why CSV is better than SAV as a file format. The solution to these issues to divide the description of a solution from the implementation of said solution, and to add rationale to the solution:

Solution:

First, the dataset needs to be converted from wide format (rows are subjects, columns are variable/timepoint combinations) to long format (rows are timepoints, variables that differ over time are specified by a single value column, and a single variable name column), which would improve the ability of analysts to run longitudinal models on the dataset. However, as wide format is useful in computing summary statistics, a script needs to be created that will take the long format dataset, and convert it over to a wide format dataset whenever necessary. The long format dataset acts as the immutable raw data, and the wide format dataset can be reconstructed whenever necessary. Finally, the long raw datafile should be stored in a delimited text format, such as a .csv and accompanied by a JSON codebook.

Implementation Details:

  • Conversion from wide to long in R (reshape/melt+cast)

  • Conversion script written as “sourceable” in R, hard coded to take long data

  • Conversion to CSV one-time non-automated via R and the foreign package

  • Codebook generated using R, filled in manually.

As you can see, while there is more writing, there are far more details, and the proposed solution can be evaluated by a non-technical researcher. The implementation details act as a guide for a technical researcher, with the aim of these details being to provide enough information that any reasonably experienced data-manager could perform them.


How to Write a Data Audit Report

I have a certain structure I like to use when I perform a data audit. Broadly, it is broken into three main sections:

Summary of the Project

This is a high level summary of the project, and is mainly included so that future readers can understand the context of the dataset itself. If, for example, the dataset in question is from a large longitudinal neuroimaging study, this summary would describe what that study was about, and also describe the relevant aspects of the study design. For example, if this neuroimaging dataset contained 4 tasks, the relevant information is what those tasks are called, how many individual runs of the tasks are there in a given dataset, and any aspect of the task that might lead to uncommon datatypes (i.e. was physiology collected during a given task?). It would not be useful to include scientific information about the study design in this summary. From a data management perspective, it makes no difference if one task is an inhibitory control task, and the other is a working memory task. That being said, this summary should point out where the actual study design documents are, so that the scientific information is accessible.

Data Locations

In the report, this section provides a high level overview of where all the data is. A machine readable file, preferably a spreadsheet, needs to be generated that contains a comprehensive list of files and a summary of their content, but this does not need to be contained in the written report itself.

I like to break this section out into meaningful divisions. For example, if you were auditing a study that had both baseline self report measures and ecological momentary assessment data, I would divide up my data locations into those two categories. Again, I wouldn’t structure this section on the basis of scientific similarity, e.g. Anxiety Measures (self report, EMA). This is because presumably, the divisions you come up with are similar in terms of their data format, which is the relevant aspect for data management.

Data Completeness

This is a checklist of every aspect of the data that you expected to be present. There are two ways I like to identify what data are expected to be present. First, I look at the design documents, usually an IRB protocol or a grant application. These list all types of data collected, but don’t necessarily describe the data format. Next, I talk to the PIs, lab managers and the RAs that run the study data collection itself. This is always an enlightening exercise, as there is usually a disconnect between what the PIs think has been collected (with respect to format), and what is actually collected and stored.  If an aspect of the data is not present at all, then that needs to be noted. If data are missing for a subset of subjects, then that needs to be noted as well (this is not referring to missingness, rather, this refers to how the dataset itself is stored). 

Issues and Solutions

This is a list of issues that arose during the audit, and proposed solutions. This should be as specific as possible, with screenshots and references as needed. It should be immediately apparent upon reading an issue a) what the auditor thinks the issue is and b) that the evidence overwhelmingly points to that issue being a real concern.

I break issues down into red flags and yellow flags. Red flag issues are serious data integrity problems: i.e. a survey is not where it is expected to be, some aspect of chain of data custody has been broken, neuroimaging files are in an unusable format, etc., etc. There is no question that these problems need to be fixed right away, or at the very least brought to somebody’s attention. Unfortunately, these are the issues that usually are the hardest to solve. For example, in a recent dataset I was working on, due to a series of drive failures on a workstation used to process neuroimaging data, all the neuroimaging data from that dataset was wiped clean. Fortunately we had backups, but we only backed up the raw data and not the processed data that had taken a previous postdoc several months to process. We only lost time, rather than losing data, but it was still problematic. As nobody had been looking at this dataset since the previous postdoc left, I was the one to detect this problem during my audit.

Yellow flag issues are a bit more of a touchy subject. These issues are ones that you have identified as sub-optimal. The problem with raising these issues though, is that they are often due to the well meaning practices of the people who collected the data, and have worked with the data for years. You are effectively telling the PI, lab manager, and RAs: “In my opinion, you did this wrong, here is a better way of doing it.” Well, quite frankly, most folks in academia don’t appreciate that sort of thing, and so it pays to be, for lack of a better work, politick, when raising these yellow flag issues. Here’s an example I’ve encountered a number of times: 

SPSS is a commonly used statistical software. I won’t fault it, it does what it says on the tin, but I personally cannot stand using it. The reason I cannot stand using it is that its native file storage format, the .SAV file, has a “proprietary” structure. These files can be opened in SPSS, but opening them in another software like R takes additional packages. More to the point, I cannot open a .SAV file in a text editor. I like files that can be opened in a text editor, if at all possible. It makes it so much quicker to look for problems, or to get an understanding of how a dataset is structured. I also make an effort to only use open source tools, so I don’t actually have a copy of SPSS installed anywhere. 

Now anybody working in psychological research will have encountered these files. For me, storing data in a .SAV (or a .mat, or any other proprietary format) is a big yellow flag issue. But I guarantee you that telling your PI they need to stop using SPSS and switch over to a simple file structure like .csv, will not go over as well as you might think. Yes, if they made the switch YOU would work faster, because presumably you are interested in automating all of your data management processes. But if everybody else is working with SPSS, then they are just not going to want to make that switch suddenly. So instead of making that very harsh suggestion, I would approach it like so:

  1. Note the concern, and describe it: .SAV files are difficult to work with using most open source scripting languages.

  2. Lay out the long term solution: In the long term, .SAV files should be converted to .csv files, and item metadata stored as .json codebooks. 

  3. Suggest a shorter term improvement: In the meantime, all .SAV files should have their names standardized (i.e. behav_ses-01_parent.sav. behav_ses-01_child.sav), and all variable names should have a standardized structure.

  4. Note the advantages of this shorter term fix: Standardization would decrease analysis time and provide guarantees with respect to linking variables (variables that link cases across multiple datasets). 

Foremost in your mind should be: How would this change in data structure improves the experience of everybody who will work with this data in the future, not just me. If you are performing a data audit, you are likely the most experienced data manager in the room, so these issues are things you know how to deal with on the fly. Your job is to smooth these issues over, so that less experienced analysts don’t get caught up on them.

Finally, I personally like to highlight things I liked about a dataset, green flags. I believe that you can’t really learn what is good practice if nobody points out what was done well, so I try to point out cases where I don’t see an issue in how the data is stored. Strictly speaking, this is not a requirement, but I’ve found it to be helpful in my own learning.


Closing Thoughts

So let’s return to the question: why perform a data audit? A good data audit produces a document that can be used to a) reference the dataset as it currently exists and b) guide a data refactor. The former is useful for anybody working with the dataset currently, the latter useful to anybody who might take on the task of actually improving how the data is stored. A data audit, in my view, is a useful service to your colleagues in the lab or your collaborators. A well documented dataset is easier to work with than a poorly documented one, and a well structured and documented dataset is even better.

Eight Principles of Good Data Management

This post is the second in a series of posts about data management practices (see the introduction here). Before I get into talking about my principles of good data management, I want say I found out after my previous post that librarians, and library science as a field, have been thinking and writing about data management for years now. Many university libraries have programs/seminars where the librarians will consult with you about data management. This is a wonderful resource, and if my own experience is common, very underutilized.  So, if you are at a university, check out your library system! 


In today’s post, I walk through 8 principles of good data management. These are wholly informed by my own experiences with data management and analysis, and I wrote these with the following in mind: When it comes to data management, you are your own worst enemy. I’ve lost count of the times I start complaining about some aspect of a dataset,  I check who made the relevant change, and it turns out it was me a week ago. So these are principles that I try to follow (and oftentimes don’t quite) to protect my work against mistakes, to save time, and to make it easier to collaborate with others. This is, of course, not an exhaustive list, nor likely the absolute best way of managing data, but I’ve found them to be helpful personally.

Document Everything

This principle should be fairly self explanatory. There should be well formatted documentation for every bit of data in a dataset. Seems simple enough, but in practice documentation tends to fall by the wayside. After all, you know what type of data you collected right? But good documentation makes everyone's work so much simpler, so what makes for a good set of documentation? For me, good documentation follows these guidelines:

  • It references files/variable names. A codebook is documentation, but we can think of any sort of data documentation as codebook-esque. It is not useful to simply say: “The UPPS was collected.” I need to know that the UPPS is in the behav_baseline.csv, and is labeled along the lines of “upps_i1_ss” (UPPS, item 1, sensation seeking subscale).

  • It’s complete. First, there shouldn’t be any bit of data that is not described by some sort of document. If you have, for example, a screenshot of a behavioral test result (which is a real thing, the test in question was proprietary, and the only way of storing a record of the score the participant got was to take a screen shot of the reporting page!), as your raw data, then these files need to be described, even though all the test results have presumably been transcribed into a machine readable file. 

This also holds for item text (and response option text), or descriptions of biological assays, or even neuroimaging acquisition details. The codebook should contain all the information about how the data was collected in a easily human-readable format. The time that an analyst spends hunting through Qualtrics surveys, or reading through scanner protocol notes is less time they spend actually analyzing your data.

  • It’s cross-referenced: For the love of science, please put in the correct references when you describe a measure in your codebook! This makes it much easier to prepare an analysis for publication. Additionally, if certain measures are repeated, or have different varieties (child, parent, teacher version), make sure that these are all linked together. The goal here is to make it easy for an analyst to understand the structure of the data with respect to the analysis they are trying to perform, and to make it easier for them to prepare that analysis for eventual publication.

  • It’s thorough: This is not the same as being complete. Thoroughness is more about how aspects of the data are described. Technically the following is a complete description:

    • UPPS Child Scale (Zapolski et al., 2011)

But it doesn’t tell you anything about that measure. A more thorough description would be: 

  • UPPS-P Child Scale (Zapolski et al., 2011): Self report impulsivity scale with 4 subscales: Negative urgency (8 items), perseverance (8 items), premediation (8 items), sensation seeking (8 items). Items are a 1-4 likert scale. Child version is a modified version of the adult UPPS (Whiteside & Lynam, 2001)

This description tells me what the scale is, what it has in it, and what to expect about the items themselves. It’s also cross referenced, with historical information. It doesn’t go into the meaning of each subscale, that wouldn’t be within scope of a codebook, but it provides meaningful information for any analyst.

  • It’s well indexed: Give us a table of contents at the very least. I don’t want to have to flip through the codebook to find exactly what I need. The ability to look for, say, baseline child self report measures, and see that they start on page 20, just makes my job much easier. 

  • It describes any deviations from the expected: Say you modified a scale from a likert of 1-5 to 1-7. That needs to be noted in the documentation, else it can cause big issues down the line. On the other hand, if you used a scale as published, you just need to provide the minimal (but thorough!) description. 

When writing a codebook, one needs to remember, you are not writing this for yourself. You are writing it for somebody who has never seen this data before (which also applies to you, 2 weeks after you last looked at the data). What do they need to know?


Chain of Processing

Very few data management issues are worse than not knowing what somebody did to a piece of data. It literally makes it unusable. If I don’t know how a fMRI image is processed, or how a scale was standardized, I cannot use it in an analysis. On the other hand, if I have documentation as to what happened, who did it, and why, I can likely recover the raw form, or, at the very least, evaluate what was done. 

This principle is obviously inspired by the idea of a “chain of custody” in criminal investigations. My (admittedly lay-person) understanding of this principle is that for evidence to be considered in a trial, there needs to be a clear record of what was done to it and by who, from the moment the piece of evidence was collected to the moment the trial concludes. This protects everybody involved, from the police (from accusations of mishandling) to the accused (from actual mishandling). Similarly, this idea applied to data management protects both the analyst and the analysis at hand.

Describing the chain of processing can be done in multiple ways. I am in favor of a combined scripting/chain of processing approach, where I write processing scripts that take raw data, process it, and return either data ready to be analyzed, or the results of an analysis themselves. In this case, the script itself shows the chain of processing, and anybody who looks at it will be able to understand what was done in a given case (if I’ve written my code to be readable, which is always a dicey proposition). Another way is the use of changelogs.  These are text (or some equivalent machine/human readable file like a JSON) files that analysts can use to note when they make changes to any aspect of the data. Sometimes changes need to be done by hand, like when the data requires cleaning by hand (i.e. psychophysiology data), and the changelog would need to be manually updated. Other times these changelogs can be created by the scripts used to process the data. 

This is such an important principle to follow that I will say, I would prefer a badly written changelog or a hard to read script to no chain of processing at all.  


Immutable Raw / Deletable Derivatives

Imagine the case where a scale has been standardized. This is a fairly common procedure, subtracting the mean and dividing by the variance. It makes variables that are on different scales comparable. Now imagine that you only have the standardized scale, and no longer have the raw data. This is a huge issue. Because you do not know what values were used to standardize the scale, you wouldn’t be able to add any more observations. 

Okay, so that might be a trivial example. Let me mention an example that I’ve encountered many times. The processed neuroimaging files are available, but the raw images are not. Here, this is usually not due to the raw data being deleted, though that has occurred in my experience. 

If you don’t have the truly raw data, you cannot recover from any data management mistakes. This means that your raw data is precious. You might never analyze the truly raw data, I know I don’t, but it needs to be kept safe from deletion or modification. Ideally, your dataset can be divided into two broad sections. The first is the rawest data possible, images right off the scanner, a .csv downloaded straight from Qualtrics. Label it well, document it, and then set it to read only and never touch it again. The second half is your data derivatives. These are any bits of data that have undergone manipulation. If you have pulled out scales from that raw dataset, that is a derivative. If you have cleaned your raw physio data, derivative. Because you are presumably following the third principle, Chain of Processing, you know precisely how each derivative bit of data was created. As such, your derivatives are safely deletable. Now, it might not be wise to delete some derivatives, for example, if your physiological data was cleaned by hand (as much of it is), even if you know exactly how it was cleaned, given the time and effort you likely shouldn’t delete those derivative files. But if push came to shove, and those files were deleted, you would be able to recover any work you had done.

I delete derivatives all the time, because my workflow involves writing data processing scripts. In fact, I often don’t even produce actual derivative files, instead keeping any processing in memory so that when I go to run an analysis, I reprocess the data each time. Whatever way you do it, make sure your raw data is immutable, read only and backed up in multiple locations, and that you have a chain of processing to tell you how each derivative file was created. If both of those are in place, you can rest much easier when thinking about how your data is stored.


Automate or Validate

I’ve mentioned scripts several times so far, so it is not a surprise that scripting is one of my principles of good data management. This principle says, if you can automate a part of your processing, do that. However, oftentimes one can’t automate fully. In those cases, write scripts to validate the results of the non-automated processing. By validation I don’t mean checks to ensure the processing was done correctly, I mean checks to make sure that your files are stored correctly, and you don’t have any file format differences.

Why write scripts/validators? Because you are a weak, tired human being (presumably?). You make mistakes, you make typos, and you forget. Sure, you are clever, and can think of unique solutions to various data issues, but a solution is only useful when applied consistently. A computer on the other hand does only exactly what it is told to do, nothing more and nothing less. Take advantage of that quality! A script written that performs your data processing will process the same way each time, and acts as a record of what was done. But what about mistakes? Yes, you will make mistakes in your code, mistakes you don’t catch until later. I develop a processing pipeline for neuroimaging data (link!). In an early development build, I didn’t add a “-” in front of a single variable in a single equation. This led to an inverting of the frequency domain filter I was implementing, so instead of removing frequencies outside of .001-.1 mHZ, it removed frequencies within .001 -.1 mHZ. Fortunately, when I was testing this function this was simple to detect, and a couple of hours of tearing my hair out looking at my code for any errors led me to find the issue and correct it.

Contrast this with an experience a colleague had with their data. They were doing a fairly simple linear regression, and needed to merge data from two spreadsheets. Each spreadsheet looked identically ordered with respect to subject ID, so they copy and pasted the columns from one to the target spreadsheet. I’ve done this, we all have, I don’t fault them for it. We really shouldn’t be doing by-hand merges though. As my colleague realized the night before they were going to submit the manuscript, that in actuality, the first dataset was only ordered the same way for the first 100 or so observations. Then there was subject ID that was not in the second dataset. So, after the copy and paste, the first 100 observations were accurately matched, but after the first 100, all the observations were offset by one row. Visually, this wasn’t apparent until you scrolled to the bottom of a very long dataset, as there were no missing rows (which visually indicates these mismatch issues quite quickly). Statistically, this effectively equivalent to randomly reordering your variables column by column. A Very Bad Thing™. No wonder they had some very counterintuitive results! I am glad they found the issue before submitting that manuscript for publication, because if it was published with that mistake, it would need to have been retracted! 

So what happened? Well, my colleague did a very innocuous, commonly done bit of data processing, in a way they had done 100 times before. Just that this time, it led, really through no fault of their own other than the same momentary lapse of attention that afflicts most of us from time to time, to a retraction worthy mistake. A retraction worthy mistake that was nearly undetectable, and was only found because my colleague happened to scroll to the bottom of the spreadsheet while looking at some unrelated aspect of the data. 

Would scripting the data merge avoid this? Categorically yes. There are ways of messing up data merges when scripting, many ways, but in my experience those become apparent very quickly. This particular issue, there was a single additional observation in the first dataset, would have been completely avoided by scripting the merge. The scripted solution would also be more robust to other potential issue, for example, what if the ordering of the observations was completely different? Well, if you script the merge, you don’t even need to worry about that, the software will take care of it.

Validators are quite useful too, though I will admit I haven’t used/encountered many of them. The one I do use is the Brain Imaging Data Standard (BIDS) validator. BIDS is a standardized way of storing neuroimaging data (and an inspiration for this series of blog posts!), and the validator simply looks at the data structure to see if everything is where it needs to be. It flags files that have typos, and identifies where metadata needs to be added. Another validator I’ve written checks to make sure file names are structured the same for datasets of psychophysiological data, which requires cleaning by hand. This leads to typos in file names, because RA’s need to click Save As, and type out the name of the file. So, I run this validator before I do any additional data kludging, just so I know my scripts are going to get all the data I was sent.  Which is a great segue into my next principle: Guarantees.  


Guarantees

What are guarantees in the context of data management? Guarantees are simple: If I am looking at a dataset, I should be guaranteed that, if there is a certain file naming structure, or file format, etc, all relevant files follow that rule. Not most, not almost all, not all almost surely, but absolutely all relevant files. One way of guaranteeing your Guarantees is to use scripts to process your data. Guarantees are all about consistency, and nothing is more consistent than a computer running a data processing script. Validators are a way of verifying that your guarantee holds. 

But why bother? Why does it matter if the variable names are not consistent between scales? Or that mid study, the behavioral task output filename convention changed? Well, if I was doing analyses with a calculator (like the stats nerd I am), I would be able to adjust for small deviations. But I’m not going to do that (still a nerd), I write analysis scripts. And again, computers only ever do precisely what they are told to do. Guarantees are a way of simplifying writing new analysis scripts, or even new processing scripts. Here is an example: Consider two variable names: “UPPS_i1_ss_p” and “Conners_p_1”. I do quite a bit of text processing to create metadata when running analyses, and in this case, I might want to pull out item level means for every item in the UPPS and the Conners. But if I do a string split on “_”, and look for the item number in the second slot, well, in the UPPS, the second slot is “i1”, but in the Conners’, the second slot is “p.” I would have to make a modified version of my UPPS processing code to fit with the Conners. 

But what if my variables were guaranteed to have the following formatting?

“scale_i#_subscale_source” (with an indicator if there is no subscale). 

Then I can write a single script that pulls the necessary information from each variable name, and apply it to every scale in my dataset. It makes programming analyses much simpler, and reduces the need to check the codebook for every new scale.   

The main benefit of guarantees is that it reduces the cognitive load on the analyst. If I know that the file names have a standard structure, and that they were script generated, I can immediately relax and not be on the lookout for any deviations in file naming that might mess up my processing stream. Because of this, I can better focus on performing whatever processing correctly. In my experience, when one has to adapt code to little idiosyncrasies in the data structure, these adaptations are where the mistakes creep in. I’ve never written cleaner code than when I work with guaranteed data.


Open (Lab) Access

Science is fundamentally collaborative. Even if you are the main statistical analyst on a project, you will still need to access data that was generated or collected by other lab members. This brings up an annoying issue, that of data permissioning. There are two ways I’ve seen this issue come up. 

The first is a literal file permission problem. I work on a variety of neuroimaging projects, and, for a variety of historical reasons, neuroimaging data is usually processed on Linux workstations. One aspect of Linux, and particularly Linux on shared workstations, is that each file has a set of access permissions. In Linux, these permissions are the following: 1) can you, the file creator, read/write/execute the file? 2) can members of your user group read/write/execute the file? And 3) can anybody else read/write/execute the file? If you happen to be using a personal Linux machine (more power to you, I guess?), this is not an issue, as you can be fairly certain that the only person accessing files on your computer is you. But on a workstation this can become an issue, because if the permissions aren’t set correctly, other members of your lab won’t be able to access any files you have created. In neuroimaging this quickly becomes problematic, as each step in preprocessing creates temporary files. About 70% of issues I have encountered using various pipelines I have developed have ultimately come down to permission issues. 

Of course, fixing actual file permissions is a fairly simple thing to do. But, there is a more problematic “permissions” issue that often occurs in larger labs. I like to refer to this as a lab balkanization of data. This is when, due to internal lab politics, different bits of data are only accessible to certain investigators. One example of this, that I have personally experienced, is where I had access to the neuroimaging data from a large longitudinal study, but the self report data was not only not accessible by me, it was held by an investigator at a university half way across the country. To get access to this data, we had to request specific cases/variables, and this investigator would then send us just those records. 

Now before I start criticizing this practice, I will note that this sort of data access issue can happen for very good reasons (as was the case with the large longitudinal study). Oftentimes, if there is very sensitive information (think substance use self report in children), that data has additional security requirements. A good example of this is the ADDHealth national survey. This is a very large national survey which collected health behavior data on high schoolers, and one unique aspect of it is that there is social network information for many of the participants. Additionally, ADDHealth collected romantic partner data for participants, including if two participants were romantic partners. This, combined with the fact that one could, theoretically, easily identify specific participants based on the extensive self report data (deductive disclosure), means that this data needs to be specially protected. Well, to access the romantic partner data, an investigator needs to dedicate an entire room, that only they can access (not anybody in their lab, just the actual investigator), with a computer that has been chained to the room and has no internet access. There are a number of other requirements, but the one that made me laugh a bit is that if you were to store the data on a solid state drive (SSD), you are required to physically destroy the drive at the end of analysis. So there are a number of cases where restricting access to sensitive data is quite reasonable. 

That being said, I believe that a PI should make every effort to ensure equal access to all data for all analysts in their lab. This smooths the working process, and reduces mistakes due to miscommunication. When I am looking for data, I usually know exactly what I need, but I might not know exactly what form it takes. If I have access to all the data for a given study, I can hunt for what I need. If I have to ask another person to send me the data, I usually will have to go back and forth a couple times to get exactly what I need. 

So what are the reasons that this balkanization happens? Usually, there is no reason. Somebody just ran a processing script on their own desktop, and never put the file in a shared drive. Occasionally, balkanization can be subtly encouraged by competitive lab culture. Grad students might keep “their” data close to the chest because they worry that somebody might scoop their results. I’ll be blunt: Scooping within a lab should be impossible. If two lab members get into this sort of conflict, the PI is either ill-informed about what is happening in the lab, as they didn’t nip it in the bud, or come down hard on whoever was trying to scoop, or malevolent, in that they encouraged this behavior in an extremely misguided belief that competition, red in tooth and claw, makes for better scientists. It categorically does not. This balkanization can also occur at the same level of investigator, for example when two labs that have collaborated on a larger study divide the data between themselves. Personally, I find this to be ridiculous, as again, any concerns about who gets to publish what paper should be dealt with by dialogue, not by restricting access. But, admittedly, when this sort of divide happens, it is rarely resolved in the fashion I prefer (data pooled and everybody has equal access), simply due to investigator inertia/ego. 

To avoid issues with data access, data storage plans should be drawn up before the first subject is collected. These plans should indicate if there are any aspects of the data that are deemed sensitive that would require secure storage. Besides that, these plans should work to provide as equal and as full access as possible to any lab member who would, reasonably, be performing analyses. Who gets to write/publish a certain project should be negotiated openly and clearly. If this kind of transparency is encouraged, then questions about who has access to data quickly becomes irrelevant.  


Redundant Metadata

Metadata refers to data about data. A good example of this are scanner parameters for neuroimaging data. The data itself is the scan, while the metadata are the acquisition settings for that scan. In neuroimaging these are vital to know, as they tell you, among many other things, how fast the data was collected, what direction the scan was in, what the actual dimensions of the image are, etc, etc… In a more traditional self report survey, metadata could be the actual text of each item, along with what the text was for the response options. 

For multiple file datasets, such as ones where there are separate data files for each subject, a piece of metadata would be which subject is associated with each file. Metadata is obviously important, but oftentimes it is only stored in a single place at a time. Take this simple example, considering two directory structures:

/data/sub-01/behavioral/baseline.csv

/data/sub-01/behavioral/sub-01_behavioral_baseline.csv

Both directory/file combinations contain the same information: the file is the baseline behavioral data for subject 01. But the second combination has redundant information. Not only does the directory structure tell you that this is the behavioral data for subject 01, the file name itself reiterates this. Why is this useful? Well, say you want to analyze all the baseline behavioral data. You extract all the baseline data into a new directory. In the first case:

/newdir/baseline.csv

/newdir/baseline(1).csv

/newdir/baseline(2).csv

In the second:

/newdir/sub-01_behavioral_baseline.csv

/newdir/sub-02_behavioral_baseline.csv

/newdir/sub-03_behavioral_baseline.csv

In the first case, you’ve lost all identifying information, while in the second case, the important metadata is carried along in the file name. I know which case I would prefer to work with! While this scenario is a bit of a straw man, it does happen. I’ve seen it in neuroimaging datasets, where the subject is indicated only at the directory level, ala /sub-01/anat/mprage.nii.gz. In fact, this is/was a fairly common data structure, as certain neuroimaging software packages effectively incentived it.  

Metadata is tricky, because there is usually so much of it and you usually don’t know every piece you might need. So, store it all!


Standardization

So, you’ve decided to implement all the principles so far, and you’ve convinced your colleagues to implement good data management practices as well. Wonderful! Your analysis pipelines are humming along, your graduate students and postdocs have slightly less haunted looks in their eyes, and you feel that warm feeling that only well organized data can give you (no? Only me then?). 

With your new found confidence in your data, you decide to strike up a collaboration with a colleague. That colleague has also jumped onto the data management train, so you are confident that when they send you data it will be well organized and easy to use. 

So they send you their data! And it is beautifully organized! Beautifully organized in a completely different way than your data!

Well, now all of your data processing/analysis scripts will need to be rewritten. This might be much easier than normal, because of how your colleague’s data is organized, but it still takes time. So, how can we streamline this? 

Now we come to the final principle, to which all other data management principles lead: Standardization. On the face of it, this principle is fairly simple. Labs working with similar study designs/datasets should use a single standard data management setups, rather than many data management setups, no matter how well managed those setups might be. However, this is much easier said than done.

Different labs/projects/PIs both have different data requirements and likely use differing software tools. This leads to a proliferation of data management choices that make a single standardized data management schema that works for every possible case nearly impossible. The closest thing I have seen to complete standardization is the (previously mentioned) BIDS format, which is only possible because there are a limited number of data sources for neuroimaging data and a great deal of effort has gone into standardizing the low level file formats used in neuroimaging (e.g. Nifti files as a standard MRI storage format.) 

If universal data management standardization is impossible, what can be standardized? I think of datasets as puzzles made up of different modalities of data. Each modality represents a type of data that shares most characteristics. For example, I consider questionnaire data to be a single modality, as is fMRI data. Conceivably, a standard format for any questionnaire data could be developed (I would suggest pairing CSV files with a metadata JSON, but there are many other ways as well). I think standardization of different modalities of data is the right way of approaching this problem.

Even with restricting the scope to specific modalities of data, true standardization is difficult. So what can the individual researcher do? Well, first, and most importantly, researchers need to be talking about data management with colleagues and students. There is a tendency for PIs to abstract away from the nitty gritty of data management and data analysis, and while I understand the reason for that (grants don’t write themselves!), this inattention is one of the leading drivers of data rot. By working in data management into scientific discussions and project planning, I find it grounds the conversation and focuses it on the question of what can we do with what we have. From there, researchers should explicitly share their data management scheme with colleagues. If you’ve saved time by implementing good data management, then likely your colleagues would benefit from adopting what you’ve done. While this can be a bit of work, I’ve found that by emphasizing the timesaving aspects of good data management, otherwise very busy PIs become much more amenable to changing around the structure of their data storage.

Ideally, as data management is discussed and setups are shared, this would naturally lead to a type of standardization. Consider a fairly simple type of standardization, a well structured variable naming scheme: scale_i#_subscale_source. Changing variable names in a dataset tends to be very simple, and once researchers see how useful a standardized naming scheme can be, it can be quickly adopted. The key here is for researchers who are trying to bring good data management practices to the table to keep up the pressure. Researchers/scientists/academics, myself included, tend to have quite a bit of inertia with how we like to do things. We get into ruts, where the way we know how to do a task is so familiar and easy for us, we continue out of convenience. But the siren call of “it will save you time” is strong, and I’ve gotten the best results when pitching standardization by emphasizing the advantages, over pointing out what is going wrong.


Summary 

The above data management principles were derived from my own experiences working with all kinds of data and are not meant to be exhaustive or overly rigid. My goals when thinking about data management is: how do I protect my work from my greatest enemy, me from a week ago, and how do I save time and cognitive energy. I don’t like wasting time, and I don’t like to repeat work. That being said, all of these principles are well and good when you are setting up a new study, but what if you are currently working with a dataset? Or you are a new graduate student or postdoc, and you are being handed a dataset? You might want to start reorganizing the data immediately into a better management structure. I would urge caution though. Not only do PIs tend to not like their datasets being unilaterally reorganized by a new member of the lab (a scenario that I know nothing about, nothing at all), you also likely don’t know enough about the dataset in question to even begin to reorganize it. In order to efficiently and correctly reorganize the data, you need to understand what you have. You need to perform a data audit, which is systematic investigation of an existing dataset for the purposes of identifying:

  1. What should be in it. 

  2. What is actually in it. 

  3. How the data is currently organized.

  4. How it could be organized better.

In my next data management post, I’ll be walking through how I perform a data audit, and what I think needs to be in one. Thanks for reading!

Data Management for Researchers: Three Tales

This is part one of a series of blog posts about best practices in data management for academic research situations. Many of the issues/principles I talk about here are less applicable to large scale tech/industry data analysis pipelines, as data in those contexts tend to be managed by dedicated database engineers/managers, and the data storage setup tends to be fairly different than in an academic setting. I also don’t touch much on sensitive data issues, where data security is paramount. It goes without saying that if a good data management practices conflict with our ethical obligations to protect sensitive data, ethics should win every time.


Good Data Management is Good Research Practice 

Data are everywhere in research. If you are performing any sort of study, in any field of science, you will be generating data files. These files will then be used in your analyses and in your grant applications. Given how important the actual data files are, it is a shame that we as scientists don’t get more training in how to manage our data. While we often have excellent data analysis training, we usually have no training at all in how to organize data, how to build processing streams for our data, and how to curate and document our data so that future researchers can use it. 

However, researchers are not rewarded for being excellent at managing data. Reviewers are never going to say how beautifully a dataset is documented and you will never get back a summary statement from the NIH commenting on the choice of comma delimited vs tab delimited files. Quite frankly, if you do manage your data well, your reward will be the lack of comments. You will know you did well if you can send a single link to a new collaborator to your data set, and they respond with, “Everything looks like it is here, and the documentation answered all of my questions.” So, given that lack of extrinsic motivation, why should you take the time and effort to learn and practice good data management? Let me illustrate a couple of examples of why good data management matters. All of the examples below are issues that arose in real life projects (albeit with details changed both for anonymity and to improve the exposition), and each one of them could have been prevented with better data management.


Reverse coded, or was it?

I once had the pleasure of working on a large scale study that involved subjects visiting the lab to take a 2 hour battery of measures. This was a massive collection effort, which fortunately, as is so often the case with statisticians, I got to avoid, and come in when the data was already collected. This lab operated primarily in SPSS, which, for those not familiar with this software, is a very common statistical analysis software used throughout the social sciences. For many, many years, SPSS was the primary software that everybody in psychology was trained on, and to its credit, is quite flexible, has many features, and easy to use. The reason it is so easy to use however, is that it is a GUI based system, where users can specify statistical analyses through a series of dialog boxes. Layered under this is a robust syntax system that users can access, however this syntax is not a fully featured scripting language like R, and is, to put it mildly, difficult to understand.

In this particular instance, I was handed the full dataset and went about my merry way doing some scale validation. But then I ran into an issue. A set of items on one particular measure were not correlating in the expected direction with the rest of the items. In this particular measure, these items were part of a subscale that was typically reverse coded. The issue was, I couldn’t determine if the items had already been reverse coded! There were no notes, and the person who prepared the data couldn’t remember what they did, and couldn’t find any syntax. Originally, I was under the impression that the dataset I was handed was completely raw, but as it turns out it had gone through 3 different people, all of whom had recomputed summary statistics, scale scores, and made other changes to the dataset. Because we couldn’t determine if the items were reverse scored, we couldn’t use the subscale, and this particular subscale was one of the most important ones in the study (I believe it was listed as one of the ones we were going to analyze in the PI’s grant, which meant we had to report those results to the funding agency.) 

After a solid month of trying everything I could to determine if the items were reverse scored or not, I ran across a folder from a graduate student that had since left the lab. In that folder, I found a SPSS syntax file, which turned out to be the syntax file used to process this specific version of the dataset. However, the only reason I determined that, is because at the end of the syntax file, the data was output to a file named identically to the one I had. 

Fortunately, this story had a happy ending in terms of data analysis, but the journey through the abyss of data management was frustrating. I spent a month (albeit on and off) trying to determine if items were reverse coded or not! That was a great deal of time wasted! Now, many of you might be thinking, why didn’t I go back to the raw data? Well, the truly raw data had disappeared, and the dataset I was working with was considered raw, so verifying against the raw data was impossible. 

What I haven’t mentioned yet, is that this was my very first real data analysis project, and I was very green to the whole data management issue. This was a formative experience for me, and led me to switch entirely over to R from SPSS, in part to avoid this scenario in the future! This situation illustrated several violations of good data management practices (these will be explained in depth in a future post):

  • The data violated the Chain of Processing rule, in that nobody could determine how the dataset I was working with was derived from the original raw data.

  • It violated the Document Everything rule, in that there was no documentation at all, at least not for the dataset itself. The measures were well documented, but here I am referring to how the actual file itself was documented. 

  • The data management practices for that study as a whole violated the Immutable Raw / Deletable Derivatives rule, in that the raw data was changed, and if we had deleted the data file (which was a derivative file) I was working with, we would have lost everything.

  • It partially violated the Open Lab Access rule, in that the processing scripts were accessible to me, but were in the student’s personal working directory, rather than saved alongside of the datafile itself.

This particular case is an excellent example of data rot. Data rot is what happens when many different people work with a collection of data without a strong set of data management policies put in place. What happens is that, over time, more and more derivative files, scripts, subsets of data files, and even copies of the raw data are created as researchers work on various projects. This is natural, and with good data management policies in place, not a problem overall. But here, data rot led to a very time consuming situation. 

Data rot is the primary enemy that good data management can help to combat, and it is usually the phenomena that causes the largest, most intractable problems (i.e. nobody can find the original raw data). It is not the only problem that good data management practices can defend against, as we will see in the next vignette.


Inconsistent filenames make Teague a dull boy.

I am often asked to help out with data kludging issues people have, which is to say I help collaborators and colleagues get the data into a form that they can work with. In one particular instance, I was helping a colleague compute a measure of reliability between two undergraduate’s data cleaning efforts. I was given two folders filled with CSV files, a file per subject per data cleaner, and I went about writing a script to match the file names between both folders, and then to compute the necessary reliability statistics. When I went to run my script, it threw a series of errors. As it turns out, sometimes one data cleaner would have cleaned one subject’s datafile, while the other data cleaner missed that subject, which is expected. So I adjusted my script and ran it again. This time it ran perfectly, and I sent the results over to my colleague. They responded within 20 minutes to say that there were far fewer subjects with reliability statistics than they expected and asked me to double check my work. I went line by line through my script and responded that, given that the filenames were consistent between both data cleaners, my script picked out all subjects with files present in both cleaner’s folders. 

Now, some of you readers might be seeing my mistake. I assumed that the filenames were consistent, and like most assumptions I make, it made a fool out of me. Looking at the file names, I found cases that looked like:

s001_upps_rewarded.csv vs. s001_ upps_rewarded_2.csv

Note the space after the first underscore  and the _2 in the second file name. These sorts of issues were everywhere in the dataset. To my tired human eyes, they were minor enough that on a cursory examination of the two folders, I missed them (though I did notice several egregious examples and corrected them). But to my computer’s unfailingly rigid eyes, these were not the same file names, and therefore my scripts didn’t work. 

The reason this happened was because when this particular data was collected the RA’s running the session had to manually type in the filename before saving it. Humans are fallible, but can adjust for small errors, while computers will do exactly what you tell them to. In my case, there was nothing wrong with the script I wrote, it did exactly what I wanted it to do, ignore any unpaired files. The issue was that there was no guarantee to the file structure. So what data management principles did this case violate?

  • Absolute Consistency: The files were inconsistently named, which caused issues with how my script detected them.

  • Automate or Validate: The files were manually named, which means that there was no guarantee that they would be named consistently. Additionally, there was no validation tool to detect any violations of the naming convention (there is now, I had to write one).

Now, this was not a serious case. I didn’t spend a month of time fixing this issue, nor was it particularly difficult to fix. I did have to spend several hours of my time manually fixing the filenames, and any amount of time spent fixing data management issues is time wasted. This is because all data management issues can be preempted, or at the very least minimized, by using good data management principles. 


Lift your voices in praise of the sacred text fmri_processing_2007.sh

In addition to behavioral data, much of my work deals with neuroimaging data, and in fact, many of the ideas in this post came out of those experiences. Neuroimaging, be it EEG, MRI, FNIRS or some other modality they invented between the time this was posted and you read it, produces massive amounts of files. For example, one common way of storing imaging data is in DICOM format. DICOMS are not single files, but rather collections of files representing “slices” of a 3D image, along with a header that contains a variety of meta-data. There might be hundreds of files in a given DICOM, and multiple DICOMs can be in the same folder. This is not necessarily an issue, as most software can determine which file goes with which image, but now imagine those files, their conversions into a better data format, associated behavioral data (usually multiple files per scan, with usually multiple scans per person), and you can get a sense of my main issue with neuroimaging data: it can be stored in an infinite number of ways.

When I first started working with neuroimaging data, I was asked to preprocess a collection of raw functional MRI scans. Preprocessing is an important step in neuroimaging, because a) corrects for a variety of artifacts and b) fixes the small issue of people having differently shaped brains (by transforming their brain images into what is known as a standard space). Preprocessing fMRI images has quite literally thousands of decision points, and I wanted to see how the lab I received the data from did it. They proceeded to send me over a shell script titled fmri_processing_2007.sh. The 2007 in the file title was the year it was originally written. This occurred in 2020. The lab I was collaborating with was using a 13 year old shell script to process their neuroimaging data. 

As aghast as I was, I couldn’t change that fact, so I took the time to try to understand what processing steps were done, and I set the script running on my local copy of the dataset. It failed almost immediately. I realized that I had made the mistake of fixing what I considered issues in file names and organization, though I did attempt to do so in a way that wouldn’t break the script. After fixing the processing script, I managed to run it and it completed processing successfully. 

At the same time as this, I was working with a different neuroimaging group, and they requested processing code to run on their end. I sent over my modified script, as it was the only processing script I had on hand, and I felt like I had made it generalizable enough it should have handled most folder structures. I was severely mistaken. My folder structure looked something like this:

/project
    /fMRI/
        s001_gng_1.nii.gz
        … 
    /MPRAGE
        s001_mprage.nii.gz
        …

While the other labs folder structure looked like:

/project/
    s001/
        fMRI/
            gng_1.nii.gz
        MPRAGE/
            mprage.nii.gz

I had written my script to assume the first component of a file name was the subject ID, which it was in my data. In the other lab’s data however, their subject IDs were specified at the level of the folder. Obviously my script would not work without substantial alteration. I don’t think they ever did make those alterations.

There are two good data management principles violated here:

  1. Redundant Metadata: In the case of the other lab, the file names did not contain the subject information. What would have happened if those files were ever removed from the subject’s folder? 

  2. Standardization: This is more of a goal rather than a principle. Imagine if both I and the other lab had used a standardized way of storing our data, and written our scripts to fit. We would have been able to pass back and forth code without an issue, and that would have saved us time and trouble.

Neither data rot nor human fallibility were to blame for these issues. In fact, both datasets were extremely consistently organized, and there were no mistakes with naming. We simply didn’t use the same data structure and it is worthwhile to ask, why? In this case, it was a simple case of inertia. Both myself and the analysts at the other lab had scripts written for a given data structure. In my case, the scripts I had were handed down from PI to PI for years, until the original reason certain data design decisions were made faded from memory. I like to term this, the sacred text effect. This usually occurs with code or scripts, but can occur with any practice. Usually the conversation goes like this:

You: Why is this data organized this way? 

Them: Because that is how my PI organized data when I was in graduate school, and besides all of our analysis scripts are designed for this data structure.

You: Would you consider changing over to a more standardized data structure? There are several issues with the current structure that would be easily fixable, and if we use this standard, we can share the data more freely as well as use tools designed for this data structure. 

Them: Sure, I guess, but could you fix our current scripts to deal with the new structure?

Suddenly you signed up for more work! It is vital that labs do not get locked into a suboptimal data management practice simply due to inertia. If a practice doesn’t work, or causes time delays, take the time to fix it. It might take time now, but you will make up that time 10 fold. A great example of this, and a major inspiration for this work is the BIDS Standard, a data structuring scheme for storing neuroimaging data.


These three cases illustrate the consequences of bad data management, but there are many more examples I could write about. To adapt a common idiom about relationships:

Every example of good data management is good in the same ways, but every example of bad data management is bad in its own unique way.

But again, it is important to point out that this is not due to any incompetence on the part of researchers. I’ve spoken and worked with many researchers who do not have the same technical background I do, and each one recognizes the issues inherent with bad data management practices (and can usually come up with a startling number of examples from their own work). It is simply that they were never trained in good data management, so they have had to figure everything out on their own, and these are very busy people. In the next post, I’ll lay out what I see as 8 principles of good data management for researchers. These principles are based out of my experience in the social and biomedical sciences, and so they might not wholly apply to, for example, database management in a corporate setting.