Today I want to drill down a bit into last week’s post (found here) about software design and talk about the first of three very general categories of software one might develop in the social sciences. These categories are a) my own taxonomy and b) only a very general taxonomy of course with many pieces of software fall into a mixture. All that to say, I have found these distinctions helpful in understanding how to structure software:
Applications - These pieces of software are designed to perform a primary task or set of tasks, while minimizing the amount of secondary knowledge (e.g. programming, data management) required of its users. This comes at the cost of being relatively inflexible.
Libraries - These pieces of software extend the capabilities of an existing programming language in some way. They require high secondary and primary knowledge of the user. This allows libraries to be very flexible in their use.
Modules - A middle ground between applications and libraries, this type of software simplifies a primary task, reduces secondary knowledge cost, and allows for a great deal of flexibility. Often, this type of software is made to work with several other modules as well.
With those brief descriptions, I want to start by discussing the general design of applications.
Applications minimize secondary knowledge cost.
The category that I refer to as “Applications” refers to any piece of software that aims to a) perform a complete task and b) minimize what additional knowledge users need to know. This is best illustrated with some examples of what I consider and don’t consider application.
Applications:
SPSS is an obvious choice for the category of application. It handles all aspects of running statistics, and it abstracts away from the language it was written in, a combination of Java and likely C.
The R package lavaan I also consider an application. It aims to handle all aspects of running SEM models, and it abstracts away from R considerably. Besides data input and some very basic function calls, most of the work in using lavaan is setting up the model syntax.
Not Applications:
I wouldn’t consider the R package ggplot2 to be an application. It performs a specific task yes, but it doesn’t abstract away from R sufficiently. Instead I would consider this a module.
The C++ library Armadillo (link) is definitely not an application, but rather I would consider this to be a library. It simply aims to extend the linear algebra capabilities of C++.
Designing the user interface for an application requires a great deal of careful consideration of what your user base is going to be, as you can have very little expectation as to the technical knowledge of a user. For example, SPSS is successful because it makes the act of running fairly complex statistical models a matter of navigating a set of graphical user interfaces (GUIs). This of course requires knowledge of the statistical models (at least in theory, if not in practice), but it doesn’t require any programming expertise. The only secondary knowledge it really requires is the ability to navigate GUIs.
Contrast this with base R’s statistical capabilities. I can easily run a regression in R in a single line of code that might take me several minutes of running through GUIs in SPSS. This however requires more knowledge. Not only do I need to know how to set up a regression, I need to understand R formulas, data input, and how to assign variable names to objects.
This “secondary knowledge cost” is what you are trying to minimize when you are writing a program. You can expect the user to know about what the program does (e.g. SPSS does statistics) and you are trying to minimize everything else the user needs to know (e.g. SPSS does not require object oriented programming).
Let me elaborate on this idea of secondary knowledge cost with a more personal example. I develop and maintain a Python package called clpipe (link). This “package“ is really a set of command line functions for quickly processing neuroimaging data on high performance clusters. For those of you who aren’t neuroimagers, neuroimaging data requires extensive processing before analysis, and this processing is quite mathematically complex. People spend entire academic careers on processing, and many software programs have been developed to perform this processing. There were several issues that I felt warranted a additional piece of software:
To get neuroimaging data from scanner to analysis requires the use of several programs at a minimum, which in turn requires the knowledge of how to use these programs (non-trivial, neuroimaging software is not typically designed well).
Quite a bit of time is spent on data management when you are working with neuroimaging data. Ideally, this can be done using some sort of scripting language, but that requires knowledge of the scripting language.
Processing neuroimaging data takes quite a bit of time. Processing subjects in parallel on high performance clusters makes this much quicker, but that requires knowledge of how to use an HPC.
So, in sum, to process neuroimaging data you not only need to know about the actual processing, you have to understand the idiosyncrasies of several neuroimaging programs, know how to do data management and ideally understand how to use a high performance cluster.
My program, clpipe, attempts to lessen this secondary knowledge cost by automating many of those steps. I have written very little code that actually processes the data, that is covered by a variety of programs that clpipe interfaces with (FMRIPREP, dcm2bids). Instead, clpipe manages data and submission of jobs to HPCs. All the secondary knowledge it requires is a working knowledge of navigating Linux filesystems (not unreasonable in neuroimaging) and a very basic understanding of how to format a couple of JSON files (configuration of the pipelines is done via JSON files). Of course, I made no attempt at lessening the primary knowledge cost. To use clpipe, you do need to know how to process neuroimaging data and all the myriad of choices you can make.
So, stepping back, what makes a good application? To me, a good application minimizes what additional things you need to know to do your primary task. The cost however, is that a good application is not flexible. It makes what it does easy, but you are SOL if you want to do something outside of that specific task (try tricking SPSS into doing something outside of what it is explicitly designed to do). So how does this translate into design principles? Here are my thoughts:
Identify what the primary task of an application is. Imagine your user as somebody who knows everything about that task (e.g. they are an expert in regression), but have absolutely no knowledge in anything else (e.g. they have never programmed in their lives).
Given that theoretical user and the restrictions on your implementation, minimize what additional things the user needs to know. If you are writing an R package to do one specific type of analysis, you are going to be hard pressed to make a GUI, but you can minimize what the user needs to know about R to use your package (again, lavaan is an excellent example of this).
Make sure not to violate the expected flow of a given task. An application is not providing the tools to do a task, it is doing the task for the user.
Be very wary of designing an application so that it is easiest for you to use. I see this quite a bit, and fall victim to it quite a bit as well. By definition, if you are developing a application, you have far more secondary knowledge than the target user.
In a related vein, don’t underestimate the lack of secondary knowledge that a user can have.
Finally, if you are developing an application, fully commit to that minimization of secondary knowledge. If you half-ass it, the resulting application will be much worse than if you decided to just develop a library or module. This is because you might be muddying user expectations of what they need to know. If you are honest with your users about expectations, that always makes for a better piece of software.
Designing an application as I defined it previously is quite a difficult task. When I started working on clpipe I was astonished how difficult it was getting to a point where users felt comfortable using it (they still don’t, but that is neither here nor there). This category is really the most design intensive of the three, because it is all about putting yourself in the place of a user who, by definition, doesn’t have the same level of knowledge you have. Think carefully, draft out your UX before you ever write a line of code, and have a number of beta testers!
Next week I will give some thoughts on how to think about developing libraries. These pieces of software are the opposite of a program, as they attempt to minimize primary knowledge cost at the price of requiring high secondary knowledge.
Cheers!
Teague