Key Bioinformatics Computer Skills

I’ve been asked several times about which computer skills are critical for bioinformatics. Important – note that I am just addressing the “computer skills” side of things here. This is my list for being a functional, comfortable bioinformatician.

  1. SQL and knowledge of databases. I always recommend that people start with MySQL, because it is crossplatform, very popular, and extremely well developed.
  2. Perl or Python. Preferably perl. It kills me to write this, because I like python so much more than perl, but from a “getting the most useful skills” perspective, I think you have to choose perl.
  3. basic Linux. Actually, being at a semi-sys admin level is even better. I always tell people to go “cold turkey” and just install Linux on their computer and commit to using it exclusively for a while. (Due to OpenOffice etc, this should be mostly doable these days). This will force a person to get comfortable. Learning to use a Mac from the command line is an ok second option, as is Solaris etc. Still, I’d have to say Linux would be preferred.
  4. basic bash shell scripting. There are still too many cases where this ends up being “just the thing to do”. And of course, this all applies to Mac.
  5. Some experience with Java or other “traditional languages” or a real understanding of  modern programming paradigms. This may seem lame or vague. But it is important to understand how traditional programming languages approach problems. At minimum, this ensures some exposure to concepts like object-oriented programming, functional programming, libraries, etc. I know that one can get all of this with python and, yes, even perl – but I fear that many bioinformatics people get away without knowing these things to their detriment.
  6. R + Bioconductor. So many great packages in Bioconductor. Comfort with R can solve a lot of problems quickly. R is only growing; if I could buy stock in R, I would!

This may seem like a lot, but many of these items fit together very well. For example, one could go “cold turkey” and just use Linux and commit to doing bioinformatics by using a combination of R, perl and shell scripting, and an SQL-based database (MySQL). It is very common in bioinformatics to link these pieces, so… not so bad, in the end, I think.

As always, comments welcome…

Free, easy, quick, great PDF creation: Try OpenOffice

keywords: free software, opensource, OpenOffice, grantwriting

I try to give credit where credit is due.

I have written before about using OpenOffice (version 2.4) for “real professional work.” In an earlier post, I wrote about successfully writing an entire grant application using OpenOffice for wordprocessing and figure creation in conjuntion with Zotero for references (and the grant was funded, so…).

PDF creation from OpenOffice (use “Export to PDF” in the File menu) simply works great. It is very fast and the pdf quality is excellent. One note – it does not open the pdf automatically – it just stores the file – so pay attention to this. This works much better than printing to a pdf using the Adobe PDF printer or using the Microsoft Office 2007 export to pdf functions (which, besides being slow, caused Microsoft Office to crash occasionally on my machine).

Also, before I forget, I really like OpenOffice Draw for scientific figure creation – I use it a lot in my work and I have been quite happy with it. I’m using Microsoft Office a fair amount now, but I still use draw to make figures. I’ve used Zotero and Draw for well over a year now, with fairly intense use.

Note: This is almost entirely based on using OpenOffice 2.4. The current version is 3.0, which I just downloaded.

TAMALg: is the package available?

I’ve received a lot of questions recently about TAMALg availability. Unfortunately, there is only a difficult-to-install package available right now; I sent it to someone recently and they had a terrible time getting it going.

I do describe the algorithm in the supplementary materials to the ENCODE spike-in competition paper (Johnson et al, Genome Research 2008).

I would love to have a simple package to distribute, but this is little supported in today’s granting environment; in fact, I don’t think that making algorithms widely available has ever been well-supported by any US funding agency. And I doubt the situation is different here in Canada.

I may be getting another undergrad soon and would task that person with working on the package. As a new faculty member, I am simply overwhelmed with basics like getting my lab going right now.

I do hope that this situation changes and thanks to all for patience.

As I have noted previously, the L2L3combo predictions produced by the TAMALPAIS server (see previous posts on this or just search for “TAMALPAIS Bieda” – no quotes, though) are the same predictions as made by TAMALg. TAMALg also adds the step of estimating enrichment via using maxfour type methodology.

So you can get good TAMALg predictions of sites just by using the webserver. I suggest going this route.

And to repeat – TAMALg is almost certainly NOT what you want for promoter arrays. Except if you have a factor in only a tiny fraction of promoters or one of the newer designs with very long promoter regions (e.g. for 10 kb promoters, might be ok).

Python and Bioinformatics and Perl: Chomp in python

Update: As many readers have commented, I have just missed the obvious – there are functions in python to do this. See comments section for details.

So I do a lot of file processing in my bioinformatics work and I’ve always really liked the perl function chomp.

I wanted to implement something in python to do this and something, that like the perl one, is able to handle multiple line endings (that is, Linux, Windows, and Mac line endings).

So this is chomp in python , in a sense a def chomp, but I rename it.

IMPORTANT: I am not guaranteeing in any way that this completely replicates chomp behavior. And, of course, this won’t work on more unusual systems that have different line ending conventions. In my work, I use UNIX/Linux, windows, and older mac stuff – so this works for those. And it handles ugly cases well, as you can see.

Enjoy! and comments welcome.

Also, this is not beautiful code! I threw this together because I was frustrated.

NOTE: you will have to adjust the tab spacing for the function to work; but you already know this… copying to HTML can be a pain…

>>> def chomppy(k):
    if k=="": return ""
    if k=="\n" or k=="\r\n" or k=="\r": return ""
    if len(k)==1: return k #depends on above case being not true
    if len(k)==2 and (k[-1]=='\n' or k[-1]=='\r'): return k[0]
    #done with weird cases, now deal with average case
    lastend=k[-2:] #get last two pieces
    if lastend=='\r\n':
        return outstr
    elif (lastend[1]=="\n" or lastend[1]=="\r"):
        return outstr
    return k

>>> chomppy(‘cow\n’)
>>> chomppy(”)

>>> chomppy(‘hat’)
>>> chomppy(‘cat\r\n’)
>>> chomppy(‘\n’)

>>> chomppy(‘\r\n’)

>>> chomppy(‘cat\r’)
>>> chomppy(‘\r’)

Viewing Large Text Files (like big GFF files) in Windows

I know, I know, many of you will say “just use Linux”. And this is true, but SignalMap from NimbleGen, which is quite convenient for viewing GFF files of ChIP-chip data, is just a windows product (and yes, I did try WINE as an emulator).

So if you try to load a 380,000 line file into Notepad (or even a much smaller file), Notepad will blow up. And wordpad even acts bad.

The good – no, great – free windows solution is Notepad++. Available here.
I’ve used it for a few years; it works great. Will easily load multiple 380,000 line files (like 40 Mb GFF files).

Notepad++ also fulfills other requirements for me: it clearly has a large-ish user base and is constantly being updated/upgraded. So it is a robust, free product.

Other good bits:
(1) will convert from Windows to Mac to Unix line endings
(2) automatically recognizes the line-ending type (important for looking at files)
(3) very good syntax highlighting for a wide variety of programming languages
(4) tabbed files means that you can easily switch from file to file
(5) it retains memory of your open files – so they will be there each time you open it
(6) good behavior when you move/change a file that you are editing – will ask you to reload/save/etc.

Some not-so-good bits:
(1) for big files, regular expression stuff is slow

Mark Bieda windows big files viewing gff files NimbleGen tutorial howto notepad problems

Linux Installation on HP Pavilion Desktop (June 2008 purchase)

Mark Bieda HP Linux install installation

This is just a brief post about my (read: my student’s) experience with installing linux on a new HP Pavilion. This is a standard model available at Futureshop and BestBuy: intel quadcore Q6660 processor, 640 Gb harddisk, 3 Gb RAM. Nice machine, only $899 here in Canada (sure to be cheaper in the USA).

So I’ve installed linux on several laptops and desktops, including Mandriva, Red Hat, Fedora, Suse. And of course I have run Knoppix and, as indicated in an earlier post, have been using DSL (Damn Small Linux) under VMPlayer for a while now.

So this time, let the undergrad do it!

Here are the notes:
(1) this computer had Windows Vista on it. Home Premium edition. We wanted to keep windows, not because I love windows, but because I have some key software that only runs on windows (e.g. NimbleGen SignalMap for looking at data).
(2) Installation of OpenSuse 10.3 caused a conflict with the windows system which led to a restore operation (nothing was lost, no big deal). So we dropped working on this one – and went to working on Ubuntu 8.04 LTS.
(3) The big problem was that the ethernet card, built into the motherboard, has known problems with talking to current linux distros. The joy of a new computer!
(4) Ubuntu installed well except for the ethernet card deal, which is a big problem.
(5) To solve the ethernet card problem, we just ended up buying a new card for the computer – it was only $19.76 at our friendly University of Calgary MicroIT store. Model is “Gigabit Ethernet PCI Card” from The model number appears to be ST1000BT32. This solved the problem, although MFU (My Friendly Undergrad) had to do something to disable the BIOS from trying to connect to the one in the motherboard (which was not deadly, but led to one of those long pauses in bootup).

The Results
Everything seems to run very well. The computer is happy, it talks to the internet (from both windows and linux) and, as usual, everything runs just a bit (or a lot, depending) on the linux side vs the windows side.

I am a longtime KDE user, and I really like KDE in this distribution (downloaded and installed as packages in Ubuntu). I guess it is technically Kubuntu, but like I said, the undergrad was doing the installation so… I got to skip on thinking about this stuff.

Sqlite (Sqlite3) quick tips: if you know SQL already

Mark Bieda SQL Sqlite Sqlite3

I’m a long-time MySQL user, but recently I’ve been using sqlite (sqlite3).
This is a sqlite tutorial, in a sense, if you know SQL.

As with my other stuff, this is based on my real experience of using this system

Why use sqlite?
The basic thing is that it installs super fast (unbelievably, you just download a .exe file for windows and run it). This is in contrast to the big MySQL model. You get to skip all that client-server business (which is really important in many cases, but not for most stuff that I do).

installation and getting started
1. download and (on windows) just place the .exe somewhere. I like to place it in C:\sqlite3\
2. (windows) At the Start button, click Run and cmd as the run command. Go to C:\sqlite3 and run
sqlite3 temp.db

Critical stuff to know
.help — gives the list of dot commands. Important and useful
.separator "," — means to separate input and output columns (fields) by commas
.separator "\t" — same but with tabs
(important) – you have to set the separator before attempting to load data from a file into the database
.output myresults.txt — starts directing all query (like SELECT statements) output to myresults.txt
.output stdout — starts directing all query (like SELECT statements) output to stdout; will close any previous output file
.import gooddata.csv mytable — imports data from gooddata.csv to mytable using the current separator value to separate fields
.tables — a list of the tables in the database
.databases — a list of the databases
.schema mytable — statements used to create mytable; will also list indexes (useful!)

Control of Sqlite3:
Ctrl-c — ends Sqlite3
; –a semicolon must be used to end a line

A typical session
Note: I “made up” this session, so there could be a few small bugs…
create mytable (idnum varchar(20), salary float, age int);
.separator "\t";
.import persondata.txt mytable;
create index idex on mytable(idnum);
select * from mytable where age<30;

How I use sqlite3:
I know SQL “by heart”, so it is pretty easy for me to do things quickly with files, especially when I have to correlate values in files. Sometimes I reformat files in bash, perl, or more recently, Python.

Note that “sets” in Python (introduced after version 2.4) give really good database like behavior. And sets are fast, in my experience.