Key Bioinformatics Computer Skills

I’ve been asked several times about which computer skills are critical for bioinformatics. Important – note that I am just addressing the “computer skills” side of things here. This is my list for being a functional, comfortable bioinformatician.

  1. SQL and knowledge of databases. I always recommend that people start with MySQL, because it is crossplatform, very popular, and extremely well developed.
  2. Perl or Python. Preferably perl. It kills me to write this, because I like python so much more than perl, but from a “getting the most useful skills” perspective, I think you have to choose perl.
  3. basic Linux. Actually, being at a semi-sys admin level is even better. I always tell people to go “cold turkey” and just install Linux on their computer and commit to using it exclusively for a while. (Due to OpenOffice etc, this should be mostly doable these days). This will force a person to get comfortable. Learning to use a Mac from the command line is an ok second option, as is Solaris etc. Still, I’d have to say Linux would be preferred.
  4. basic bash shell scripting. There are still too many cases where this ends up being “just the thing to do”. And of course, this all applies to Mac.
  5. Some experience with Java or other “traditional languages” or a real understanding of  modern programming paradigms. This may seem lame or vague. But it is important to understand how traditional programming languages approach problems. At minimum, this ensures some exposure to concepts like object-oriented programming, functional programming, libraries, etc. I know that one can get all of this with python and, yes, even perl – but I fear that many bioinformatics people get away without knowing these things to their detriment.
  6. R + Bioconductor. So many great packages in Bioconductor. Comfort with R can solve a lot of problems quickly. R is only growing; if I could buy stock in R, I would!

This may seem like a lot, but many of these items fit together very well. For example, one could go “cold turkey” and just use Linux and commit to doing bioinformatics by using a combination of R, perl and shell scripting, and an SQL-based database (MySQL). It is very common in bioinformatics to link these pieces, so… not so bad, in the end, I think.

As always, comments welcome…

Free, easy, quick, great PDF creation: Try OpenOffice

keywords: free software, opensource, OpenOffice, grantwriting

I try to give credit where credit is due.

I have written before about using OpenOffice (version 2.4) for “real professional work.” In an earlier post, I wrote about successfully writing an entire grant application using OpenOffice for wordprocessing and figure creation in conjuntion with Zotero for references (and the grant was funded, so…).

PDF creation from OpenOffice (use “Export to PDF” in the File menu) simply works great. It is very fast and the pdf quality is excellent. One note – it does not open the pdf automatically – it just stores the file – so pay attention to this. This works much better than printing to a pdf using the Adobe PDF printer or using the Microsoft Office 2007 export to pdf functions (which, besides being slow, caused Microsoft Office to crash occasionally on my machine).

Also, before I forget, I really like OpenOffice Draw for scientific figure creation – I use it a lot in my work and I have been quite happy with it. I’m using Microsoft Office a fair amount now, but I still use draw to make figures. I’ve used Zotero and Draw for well over a year now, with fairly intense use.

Note: This is almost entirely based on using OpenOffice 2.4. The current version is 3.0, which I just downloaded.

Jobs: Graduate Student Funded Position available

I just want to write a short note that a FUNDED position is still available for a masters or Ph.D. student who would be a joint student with my lab and that of Gordon Chua here at University of Calgary. This is the same position posted earlier, so click the Jobs category and look at the previous posting…

I do think this would be an exciting and challenging position (but of course I do).

Bioinformatics: Sequence Alignment Is Central…?

Keywords: Illumina, Sequence Alignment, algorithms, teaching, next-generation sequencing

I haven’t posted in a while; I have been busy teaching bioinformatics. I do receive an occasional email or question about learning bioinformatics, so why don’t I just write what I taught here?

Here, at least, was my thinking on the subject. Remember that I was teaching second year students with a variety of backgrounds.

The first point is that sequence analysis/alignment is the heart of bioinformatics. Ok, you can argue with me on this. But I think that sequence alignment is, without question, a major – if not THE major – success in bioinformatics. Why do I say this?

1. Sequence alignment is non-trivial.

2. Sequence alignment approaches derive from a solid mathematical basis.

3. There are well worked out statistics for sequence alignment.

4. Sequence alignment is extremely prevalent and popular as an application of bioinformatics – not least of which is evolutionary studies of gene change and, of course, analysis of the rapidly growing number of fully sequenced genomes (or even partially sequenced ones, for that matter).

5. New situations that are variants/subsets/offshoots of sequence alignment are emerging that have already produced new algorithmic/computational frameworks. So, although this is arguably a fairly mature area of study (I think so), there is new work being done. Specifically, I am thinking of new sequence alignment approaches for next-generation sequence data (esp. short reads like Illumina, ABI) and (probably) also for metagenomics data. In the case of next-generation sequencing, mostly we want to align near-perfect reads – optimizing this for tens of millions of reads is non-trivial. Some recent work that looks good is ZOOM! in Bioinformatics 2008 24:2431 and SeqMap in Bioinformatics 2008 24:2395. (But note that I have not used either at all yet).

As a route to teaching bioinformatics, I also like sequence alignment because it touches on major topics in bioinformatics/biology: alignment itself, evolution of sequence (including phylogenetic tree construction), hidden markov models (profile HMMs, pair HMMs, PAM for alignment), etc. So just by examining sequence alignment, I end up introducing major “techniques” in bioinformatics (note that this point is certainly not original; you see it in the famous Durbin et al. book Biological Sequence Analysis and in other books like Mount’s text Bioinformatics).

TAMALg: is the package available?

I’ve received a lot of questions recently about TAMALg availability. Unfortunately, there is only a difficult-to-install package available right now; I sent it to someone recently and they had a terrible time getting it going.

I do describe the algorithm in the supplementary materials to the ENCODE spike-in competition paper (Johnson et al, Genome Research 2008).

I would love to have a simple package to distribute, but this is little supported in today’s granting environment; in fact, I don’t think that making algorithms widely available has ever been well-supported by any US funding agency. And I doubt the situation is different here in Canada.

I may be getting another undergrad soon and would task that person with working on the package. As a new faculty member, I am simply overwhelmed with basics like getting my lab going right now.

I do hope that this situation changes and thanks to all for patience.

As I have noted previously, the L2L3combo predictions produced by the TAMALPAIS server (see previous posts on this or just search for “TAMALPAIS Bieda” – no quotes, though) are the same predictions as made by TAMALg. TAMALg also adds the step of estimating enrichment via using maxfour type methodology.

So you can get good TAMALg predictions of sites just by using the webserver. I suggest going this route.

And to repeat – TAMALg is almost certainly NOT what you want for promoter arrays. Except if you have a factor in only a tiny fraction of promoters or one of the newer designs with very long promoter regions (e.g. for 10 kb promoters, might be ok).

Commentary: ChIP-chip vs ChIP-seq and $$

Ok, so with the rush to ChIP-seq and all the hype (much of it deserved) around “next-generation” sequencing generally, you might think that arrays are dead as used for ChIP (i.e. ChIP-chip).

I don’t think this is going to happen for simple cost reasons. For the near future, there will be lots of genome-scale ChIP studies and, for these, I strongly support ChIP-seq. It is a lot cheaper for better data. But I see a strong trend toward ChIP studies targeted toward specific biological questions and often questions requiring large sample numbers (e.g. epigenetic changes is cancer).

The financial math really isn’t that hard; with ChIP-seq running ~$5000 for external users and ChIP-chip running at $660 for external users (NimbleGen single arrays), it seems pretty clear that if a fair number of samples are involved, ChIP-chip is the way to go. That is, unless high-res whole genome coverage is absolutely necessary (usually not).

Furthermore, for taking chances on experiments, $660/sample is a lot more appealing on a lab budget than $5000/sample, particularly when you consider that, in the real world, even poor testing of a speculative idea is going to take 2 or 3 samples at minimum (=~$15,000 for ChIP-seq vs $1980 for ChIP-chip). A lot of labs can blow $2000; blowing $15,000 really hurts.

Given this analysis, it seems to me that NimbleGen should really push the low end of the market – in other words, try to get the cost even lower on a per sample basis (for fewer spots). I think they are on the right track with their multiplex arrays, but development of these has been disappointingly slow, and last time I looked, the cost structure around the 4plex with 70K/quadrant really wasn’t very attractive.

I may revisit this topic another time, but that is it for now.

Python and Bioinformatics and Perl: Chomp in python

Update: As many readers have commented, I have just missed the obvious – there are functions in python to do this. See comments section for details.

So I do a lot of file processing in my bioinformatics work and I’ve always really liked the perl function chomp.

I wanted to implement something in python to do this and something, that like the perl one, is able to handle multiple line endings (that is, Linux, Windows, and Mac line endings).

So this is chomp in python , in a sense a def chomp, but I rename it.

IMPORTANT: I am not guaranteeing in any way that this completely replicates chomp behavior. And, of course, this won’t work on more unusual systems that have different line ending conventions. In my work, I use UNIX/Linux, windows, and older mac stuff – so this works for those. And it handles ugly cases well, as you can see.

Enjoy! and comments welcome.

Also, this is not beautiful code! I threw this together because I was frustrated.

NOTE: you will have to adjust the tab spacing for the function to work; but you already know this… copying to HTML can be a pain…

>>> def chomppy(k):
    if k=="": return ""
    if k=="\n" or k=="\r\n" or k=="\r": return ""
    if len(k)==1: return k #depends on above case being not true
    if len(k)==2 and (k[-1]=='\n' or k[-1]=='\r'): return k[0]
    #done with weird cases, now deal with average case
    lastend=k[-2:] #get last two pieces
    if lastend=='\r\n':
        outstr=k[:-2]
        return outstr
    elif (lastend[1]=="\n" or lastend[1]=="\r"):
        outstr=k[:-1]
        return outstr
    return k

>>> chomppy(‘cow\n’)
‘cow’
>>> chomppy(”)

>>> chomppy(‘hat’)
‘hat’
>>> chomppy(‘cat\r\n’)
‘cat’
>>> chomppy(‘\n’)

>>> chomppy(‘\r\n’)

>>> chomppy(‘cat\r’)
‘cat’
>>> chomppy(‘\r’)

Follow

Get every new post delivered to your Inbox.