Docker for Bioinformatics: An enormous set of images (3007! at last count)

keywords: docker, bioinformatics, software

I’m a huge fan of docker (like most everyone, it seems). My lab has been working on some docker images and pipelines including our custom code (not released yet). I’ve been using a lot of docker images to do quick analysis – I’ll write more on this in another post.

I just ran across an enormous set of “dockerized” bioinformatics software – look at

As of today, there are 3007 images and they seem to encompass a lot of popular packages – like samtools.

I really like the documentation of the images and dockerfiles on this site – very easy to see what is actually in the image.

One issue: some packages are frequently updated – and the updates are important but the images are a bit behind. So be careful with version issues. The bcftools image is at least one version behind, for example.

Always, comments welcome.


2009 post: Key Bioinformatics Computer Skills

Note: this was written in 2009 so… out of date somewhat!

I’ve been asked several times about which computer skills are critical for bioinformatics. Important – note that I am just addressing the “computer skills” side of things here. This is my list for being a functional, comfortable bioinformatician.

  1. SQL and knowledge of databases. I always recommend that people start with MySQL, because it is crossplatform, very popular, and extremely well developed.
  2. Perl or Python. Python wins now! (2017 update!)  Preferably perl. It kills me to write this, because I like python so much more than perl, but from a “getting the most useful skills” perspective, I think you have to choose perl.
  3. basic Linux. Actually, being at a semi-sys admin level is even better. I always tell people to go “cold turkey” and just install Linux on their computer and commit to using it exclusively for a while. (Due to OpenOffice etc, this should be mostly doable these days). This will force a person to get comfortable. Learning to use a Mac from the command line is an ok second option, as is Solaris etc. Still, I’d have to say Linux would be preferred.
  4. basic bash shell scripting. There are still too many cases where this ends up being “just the thing to do”. And of course, this all applies to Mac.
  5. Some experience with Java or other “traditional languages” or a real understanding of  modern programming paradigms. This may seem lame or vague. But it is important to understand how traditional programming languages approach problems. At minimum, this ensures some exposure to concepts like object-oriented programming, functional programming, libraries, etc. I know that one can get all of this with python and, yes, even perl – but I fear that some many bioinformatics people get away without knowing these things to their detriment.
  6. R + Bioconductor. So many great packages in Bioconductor. Comfort with R can solve a lot of problems quickly. R is only growing; if I could buy stock in R, I would!

This may seem like a lot, but many of these items fit together very well. For example, one could go “cold turkey” and just use Linux and commit to doing bioinformatics by using a combination of R, perl and shell scripting, and an SQL-based database (MySQL). It is very common in bioinformatics to link these pieces, so… not so bad, in the end, I think.

As always, comments welcome…

2008 post: Bioinformatics: Sequence Alignment Is Central…?

Keywords: Illumina, Sequence Alignment, algorithms, teaching, next-generation sequencing

I haven’t posted in a while; I have been busy teaching bioinformatics. I do receive an occasional email or question about learning bioinformatics, so why don’t I just write what I taught here?

Here, at least, was my thinking on the subject. Remember that I was teaching second year students with a variety of backgrounds.

The first point is that sequence analysis/alignment is the heart of bioinformatics. Ok, you can argue with me on this. But I think that sequence alignment is, without question, a major – if not THE major – success in bioinformatics. Why do I say this?

1. Sequence alignment is non-trivial.

2. Sequence alignment approaches derive from a solid mathematical basis.

3. There are well worked out statistics for sequence alignment.

4. Sequence alignment is extremely prevalent and popular as an application of bioinformatics – not least of which is evolutionary studies of gene change and, of course, analysis of the rapidly growing number of fully sequenced genomes (or even partially sequenced ones, for that matter).

5. New situations that are variants/subsets/offshoots of sequence alignment are emerging that have already produced new algorithmic/computational frameworks. So, although this is arguably a fairly mature area of study (I think so), there is new work being done. Specifically, I am thinking of new sequence alignment approaches for next-generation sequence data (esp. short reads like Illumina, ABI) and (probably) also for metagenomics data. In the case of next-generation sequencing, mostly we want to align near-perfect reads – optimizing this for tens of millions of reads is non-trivial. Some recent work that looks good is ZOOM! in Bioinformatics 2008 24:2431 and SeqMap in Bioinformatics 2008 24:2395. (But note that I have not used either at all yet).

As a route to teaching bioinformatics, I also like sequence alignment because it touches on major topics in bioinformatics/biology: alignment itself, evolution of sequence (including phylogenetic tree construction), hidden markov models (profile HMMs, pair HMMs, PAM for alignment), etc. So just by examining sequence alignment, I end up introducing major “techniques” in bioinformatics (note that this point is certainly not original; you see it in the famous Durbin et al. book Biological Sequence Analysis and in other books like Mount’s text Bioinformatics).

2008 post: TAMALg: is the package available?

I’ve received a lot of questions recently about TAMALg availability. Unfortunately, there is only a difficult-to-install package available right now; I sent it to someone recently and they had a terrible time getting it going.

I do describe the algorithm in the supplementary materials to the ENCODE spike-in competition paper (Johnson et al, Genome Research 2008).

I would love to have a simple package to distribute, but this is little supported in today’s granting environment; in fact, I don’t think that making algorithms widely available has ever been well-supported by any US funding agency. And I doubt the situation is different here in Canada.

I may be getting another undergrad soon and would task that person with working on the package. As a new faculty member, I am simply overwhelmed with basics like getting my lab going right now.

I do hope that this situation changes and thanks to all for patience.

As I have noted previously, the L2L3combo predictions produced by the TAMALPAIS server (see previous posts on this or just search for “TAMALPAIS Bieda” – no quotes, though) are the same predictions as made by TAMALg. TAMALg also adds the step of estimating enrichment via using maxfour type methodology.

So you can get good TAMALg predictions of sites just by using the webserver. I suggest going this route.

And to repeat – TAMALg is almost certainly NOT what you want for promoter arrays. Except if you have a factor in only a tiny fraction of promoters or one of the newer designs with very long promoter regions (e.g. for 10 kb promoters, might be ok).

2008 post: Python for Perl Programmers (and Bioinformatics people)

Uh, this is so old that it should be skipped, I think… I’m keeping it up for archival sake


Mark Bieda python getting started quick tips hints tutorial

I wanted to write a short post about getting started in python.

What you will like about Python as a perl person:
(1) A great thing is the interpreter. This will allow really rapid learning of python. For a perl person, python should come really fast. I was very, very surprised at how quickly I was writing actually useful (not toy) programs to manipulate things.
(2) It is easy to install in windows and has a decent editor/run environment (IDLE). Python is now a standard part of Linux distros, except for the smallest ones (perl is everywhere, so an advantage to perl here, but only a small one).

Some key things:
(1) The online manuals for python are good (but maybe not great). The Guido tutorial is key; make sure that you get the latest one.
(2) If you like to have a book on the python around (I always do for my programming language du jour), then make sure that you have the most recent one.
(3) Why the emphasis on the most recent? Python has added key new features in recent times – like even since version 2.4! So make sure that you have the latest documentation.

Installation and Usage:
(1) For windows people, use the IDLE editor. Really. You will find it very easy to use and efficient. It comes in the download, so no installation deal.
(2) To learn python really fast, just play with commands in the interpreter window. It really is easy and efficient – a very quick way to get up to speed on things.

Some key things for bioinformatics people, in particular:
(1) Sets. Sets are very nice. Intersection, union… all that stuff that you want to use.
(2) A lot of string manipulation functions (actually methods, technically) are available. These will do a lot of what you would do with regular expressions, but see the next point.
(3) Unfortunately, regular expressions are in an external (but standard library) and are a bit different from perl in usage/implementation.
(4) Like perl, the built-in sorting in python is weird (and annoying to set up to do anything beyond simple), but very useful. Again, here, make sure that you look at the latest documentation.
(5) Sqlite library is now part of the standard package. I haven’t used it yet as part of python – but given that this is a standard part of the distribution, it seems like I could write code that uses it and not worry about portability issues. This is well worth looking at for bioinformatics people.
(6) Remember that tuples are unchangeable (immutable) and lists are changeable. So far, this has led me to be pretty list-oriented, but I am new to this.

I’ll leave it at that for now. I’ll write more about python later on.

2008 post: I wish I had… started with python earlier…

So far, my bioinformatics work has used a melange of perl, R, and bash scripting. While this has worked pretty well, it does have limits. For one, it is very not portable (bash scripting). I’ve already had problems with distributing software.

I wanted something that I could distribute in an easier way, yet had the advantages of perl. I found Jython, which is Python-in-Java. For me, the big deal is not use of Java libraries, but rather that the language would compile to Java byte-code and hence would be easy to distribute.

But I found that Python is much more than this: the interactive environment, for one, makes me ok with not having my unix/linux toolbox when I am stuck on the windows side.

And Python has a lot of nice features for bioinformatics work, including convenient types like sets (as of version 2.4) and even comes with sqlite (which I have not used from python, but want to)…

Anyways, for now, I am a fan.

2008 post: TAMALPAIS and promoter arrays

TAMALPAIS NimbleGen Promoter Arrays Array Analysis Problems Mark Bieda

I’ve been receiving some questions on TAMALPAIS usage for promoter arrays via email.

On the TAMALPAIS website, I say “Do not use this for promoter arrays.

This is actually not quite true; there are a limited number of cases in which TAMALPAIS will perform well for promoter arrays. In this post, I discuss this.

When TAMALPAIS is ok for promoter arrays:
In short:
1. If your factor only binds to a tiny portion of the promoters (<5%), then TAMALPAIS will perform ok.
2. More correct – and important – if only a small number of probes on the array are within binding sites for your factor, then you are ok. So: for promoter array designs with long promoters, you might have 15% of the promoters with a binding site. But only a small number of probes in the binding sites. (Hopefully this makes sense.)

Why do I say “Do not use TAMALPAIS for promoter arrays”?
If you have a factor that binds to (or exists in) a lot of promoter regions – like POLII or some histone modifications – then TAMALPAIS will give you bad results. I don’t want that to happen. Right now, study of histone mods and POLII are a big deal, so I don’t want people to be unhappy.

If not TAMALPAIS, then what?
There are a number of options. I developed maxfour to score promoters (see Krig et al. 2007 in JBC). I will be releasing an easy to use version of this software by the fall 2008 (planned, not a promise). This is really the best option with NimbleGen’s current crop of designs, in my opinion. Someone else may have some great promoter array analysis software; I’m not aware of this right now – feel free to email me or leave comments. I don’t mean to be unfair to other bioinformaticians with this.

What about the promoter array analysis server?
Ah, yes. This does very limited analysis – see my post on it in this blog (click the promoter array category button on the sidepanel).