Welcome to the web home of the Stoltzfus research group at The Institute Formerly Known as CARB.

Best practices for scientific programmers - top ten

Today I'm teaching a session on "Best Practices" in a "Programming for Biologists" course.  My course materials are online (feedback from other instructors is welcome).  I'll start out with my "top ten" list:

  • Interface, interface, interface
  • Modularize
  • Write code to be understood
  • Write tests and trap errors
  • Stamp your output
  • Use revision control
  • Make use of prior art
  • Create an installable package
  • Make your project open source
  • Set up a project management infrastructure

The first 3 are universally important, not just for scientists.  For scientists without formal training, I would stress the practice of designing interfaces before writing the "guts" of the code.  To stress the importance of interfaces, we have an exercise to write the skeleton of a script that has only 1 line of bioinformatics (going to NCBI to get something), and all the rest is interface-- including command-line options, help message, internal documentation, output-translation.  

To prioritize the others, we have to take into account the typical conditions of scientific programming.  In my experience, the typical scientific software product:  

  • is supposed to do one thing accurately and reproducibly 
  • has one or a few users 
  • has a short (project-specific) lifespan 

This kind of software typically is not-- and does not need to be-- robust (to various inputs), optimized for performance, and well documented. For the scientific programmer developing scripts for a project, the focus should be on accuracy and provenance, so that when you are getting ready to write up your results for publication, you know exactly what was done, and when it was done. Revision control is part of keeping an accurate record-- it lets you know which version of a script was current at each point in time.  Stamping all of your output files with name, date, and version is also critical to record-keeping.

Even if you are just writing scripts for yourself, its helpful to get in the habit of writing scripts as though your friends and colleagues were going to use them as well.  This means keeping the interfaces clean and providing internal documentation.  Even if you are the only one who uses your scripts, you will find this helpful-- every programmer learns within a few years that we quickly forget things like how a script was used, or why it was written in a particular way.

Most scientific programmers do not spend nearly enough time incorporating tests and traps into their code, which is critical for getting accurate and robust code. And of course, once you get to the stage of having a package of inter-dependent software parts, it is essential to have a test-suite that allows you to maintain functionality while adapting and improving the code

If I had a second class session to use, I would focus the whole thing on writing tests and traps.