Bioinformatics: Sequence Alignment Is Central…?

Keywords: Illumina, Sequence Alignment, algorithms, teaching, next-generation sequencing

I haven’t posted in a while; I have been busy teaching bioinformatics. I do receive an occasional email or question about learning bioinformatics, so why don’t I just write what I taught here?

Here, at least, was my thinking on the subject. Remember that I was teaching second year students with a variety of backgrounds.

The first point is that sequence analysis/alignment is the heart of bioinformatics. Ok, you can argue with me on this. But I think that sequence alignment is, without question, a major – if not THE major – success in bioinformatics. Why do I say this?

1. Sequence alignment is non-trivial.

2. Sequence alignment approaches derive from a solid mathematical basis.

3. There are well worked out statistics for sequence alignment.

4. Sequence alignment is extremely prevalent and popular as an application of bioinformatics – not least of which is evolutionary studies of gene change and, of course, analysis of the rapidly growing number of fully sequenced genomes (or even partially sequenced ones, for that matter).

5. New situations that are variants/subsets/offshoots of sequence alignment are emerging that have already produced new algorithmic/computational frameworks. So, although this is arguably a fairly mature area of study (I think so), there is new work being done. Specifically, I am thinking of new sequence alignment approaches for next-generation sequence data (esp. short reads like Illumina, ABI) and (probably) also for metagenomics data. In the case of next-generation sequencing, mostly we want to align near-perfect reads – optimizing this for tens of millions of reads is non-trivial. Some recent work that looks good is ZOOM! in Bioinformatics 2008 24:2431 and SeqMap in Bioinformatics 2008 24:2395. (But note that I have not used either at all yet).

As a route to teaching bioinformatics, I also like sequence alignment because it touches on major topics in bioinformatics/biology: alignment itself, evolution of sequence (including phylogenetic tree construction), hidden markov models (profile HMMs, pair HMMs, PAM for alignment), etc. So just by examining sequence alignment, I end up introducing major “techniques” in bioinformatics (note that this point is certainly not original; you see it in the famous Durbin et al. book Biological Sequence Analysis and in other books like Mount’s text Bioinformatics).