Key Bioinformatics Computer Skills

I’ve been asked several times about which computer skills are critical for bioinformatics. Important – note that I am just addressing the “computer skills” side of things here. This is my list for being a functional, comfortable bioinformatician.

  1. SQL and knowledge of databases. I always recommend that people start with MySQL, because it is crossplatform, very popular, and extremely well developed.
  2. Perl or Python. Preferably perl. It kills me to write this, because I like python so much more than perl, but from a “getting the most useful skills” perspective, I think you have to choose perl.
  3. basic Linux. Actually, being at a semi-sys admin level is even better. I always tell people to go “cold turkey” and just install Linux on their computer and commit to using it exclusively for a while. (Due to OpenOffice etc, this should be mostly doable these days). This will force a person to get comfortable. Learning to use a Mac from the command line is an ok second option, as is Solaris etc. Still, I’d have to say Linux would be preferred.
  4. basic bash shell scripting. There are still too many cases where this ends up being “just the thing to do”. And of course, this all applies to Mac.
  5. Some experience with Java or other “traditional languages” or a real understanding of  modern programming paradigms. This may seem lame or vague. But it is important to understand how traditional programming languages approach problems. At minimum, this ensures some exposure to concepts like object-oriented programming, functional programming, libraries, etc. I know that one can get all of this with python and, yes, even perl – but I fear that many bioinformatics people get away without knowing these things to their detriment.
  6. R + Bioconductor. So many great packages in Bioconductor. Comfort with R can solve a lot of problems quickly. R is only growing; if I could buy stock in R, I would!

This may seem like a lot, but many of these items fit together very well. For example, one could go “cold turkey” and just use Linux and commit to doing bioinformatics by using a combination of R, perl and shell scripting, and an SQL-based database (MySQL). It is very common in bioinformatics to link these pieces, so… not so bad, in the end, I think.

As always, comments welcome…

Free, easy, quick, great PDF creation: Try OpenOffice

keywords: free software, opensource, OpenOffice, grantwriting

I try to give credit where credit is due.

I have written before about using OpenOffice (version 2.4) for “real professional work.” In an earlier post, I wrote about successfully writing an entire grant application using OpenOffice for wordprocessing and figure creation in conjuntion with Zotero for references (and the grant was funded, so…).

PDF creation from OpenOffice (use “Export to PDF” in the File menu) simply works great. It is very fast and the pdf quality is excellent. One note – it does not open the pdf automatically – it just stores the file – so pay attention to this. This works much better than printing to a pdf using the Adobe PDF printer or using the Microsoft Office 2007 export to pdf functions (which, besides being slow, caused Microsoft Office to crash occasionally on my machine).

Also, before I forget, I really like OpenOffice Draw for scientific figure creation – I use it a lot in my work and I have been quite happy with it. I’m using Microsoft Office a fair amount now, but I still use draw to make figures. I’ve used Zotero and Draw for well over a year now, with fairly intense use.

Note: This is almost entirely based on using OpenOffice 2.4. The current version is 3.0, which I just downloaded.

Viewing Large Text Files (like big GFF files) in Windows

I know, I know, many of you will say “just use Linux”. And this is true, but SignalMap from NimbleGen, which is quite convenient for viewing GFF files of ChIP-chip data, is just a windows product (and yes, I did try WINE as an emulator).

So if you try to load a 380,000 line file into Notepad (or even a much smaller file), Notepad will blow up. And wordpad even acts bad.

The good – no, great – free windows solution is Notepad++. Available here.
I’ve used it for a few years; it works great. Will easily load multiple 380,000 line files (like 40 Mb GFF files).

Notepad++ also fulfills other requirements for me: it clearly has a large-ish user base and is constantly being updated/upgraded. So it is a robust, free product.

Other good bits:
(1) will convert from Windows to Mac to Unix line endings
(2) automatically recognizes the line-ending type (important for looking at files)
(3) very good syntax highlighting for a wide variety of programming languages
(4) tabbed files means that you can easily switch from file to file
(5) it retains memory of your open files – so they will be there each time you open it
(6) good behavior when you move/change a file that you are editing – will ask you to reload/save/etc.

Some not-so-good bits:
(1) for big files, regular expression stuff is slow

Mark Bieda windows big files viewing gff files NimbleGen tutorial howto notepad problems

Free Multiplatform Reference Management? Try Zotero

Mark Bieda zotero references computer software citations

You use Endnote, refman, or one of the others. You want a free alternative because (1) you don’t want to worry about licensing issues (like buying a new copy for each computer) (2) you want something that will run under windows, linux, and mac os x (3) you just don’t want to pay or (4) you want to move your references from place to place without having to adapt to the local software choice (i.e. some places will have Endnote, others will have RefMan, others will have other solutions) or (5) you just believe stuff like this should be free.

So: I have been using Zotero for over a year. Zotero is great for everyday web stuff, but here I will just talk about it as a reference manager.

As with my other software comments, this is based on my real experience. I recently wrote an entire grant using Zotero as my only reference manager. And it worked well.

A key thing:
Zotero is heavily and institutionally supported (see the webpage). From the forum comments, you can see that many users are in academe. So it should only get better

Problems/Weaknesses:
(1) This is clearly still in development. But, as I said, I wrote a grant with it – and it worked well for me, but it is not as smooth as EndNote in many ways.
(2) There are a limited number of citation styles, but this number is growing – and you can define your own. For things like grants, usually you get to choose a style. For a typical paper, you won’t have a large number of references, and a little manual editing. Still, because of this, Endnote really still has a big edge.

Getting it:
(1) Zotero is a firefox extension and, when you go to the site, seems more geared toward web-based research.
(2) Installation is superfast and easy. Firefox is the way to go. No internet explorer version.
(3) You will also need to download plug-ins for either Microsoft Word or Openoffice Writer. I used OpenOffice Writer for my grant.

Basics:
(1) There is a tutorial on the website, unfortunately oriented mostly toward the MS Word usage. The same rules apply.
(2) IF you are using OpenOffice Writer, here is something to be careful with: don’t save your files in .doc (MS word) format. I usually do, because I need to send files to colleagues, all of who have MS word but not OpenOffice Writer. If you do this, you will lose the ability to handle your citations.

Getting going:
download and install Zotero from the Zotero website
download and install the appropriate word processing plugin

To get citations:
(1) you can import from many, many sites – like Pubmed, notably.You just click on a button when you find something you like and it gets imported into Zotero.

Recommendations:
(1) When I last looked (about April, 2008), the documentation for Zotero was generally very good, but the documentation for the citation/reference aspects was very poor. So I strongly suggest that you download a few references and play with a pretend, test document to get a sense of how zotero works and your results. I did this and it really helped me use it. Only took a few minutes of playing around.

Linux Installation on HP Pavilion Desktop (June 2008 purchase)

Mark Bieda HP Linux install installation

This is just a brief post about my (read: my student’s) experience with installing linux on a new HP Pavilion. This is a standard model available at Futureshop and BestBuy: intel quadcore Q6660 processor, 640 Gb harddisk, 3 Gb RAM. Nice machine, only $899 here in Canada (sure to be cheaper in the USA).

So I’ve installed linux on several laptops and desktops, including Mandriva, Red Hat, Fedora, Suse. And of course I have run Knoppix and, as indicated in an earlier post, have been using DSL (Damn Small Linux) under VMPlayer for a while now.

So this time, let the undergrad do it!

Here are the notes:
(1) this computer had Windows Vista on it. Home Premium edition. We wanted to keep windows, not because I love windows, but because I have some key software that only runs on windows (e.g. NimbleGen SignalMap for looking at data).
(2) Installation of OpenSuse 10.3 caused a conflict with the windows system which led to a restore operation (nothing was lost, no big deal). So we dropped working on this one – and went to working on Ubuntu 8.04 LTS.
(3) The big problem was that the ethernet card, built into the motherboard, has known problems with talking to current linux distros. The joy of a new computer!
(4) Ubuntu installed well except for the ethernet card deal, which is a big problem.
(5) To solve the ethernet card problem, we just ended up buying a new card for the computer – it was only $19.76 at our friendly University of Calgary MicroIT store. Model is “Gigabit Ethernet PCI Card” from startech.com. The model number appears to be ST1000BT32. This solved the problem, although MFU (My Friendly Undergrad) had to do something to disable the BIOS from trying to connect to the one in the motherboard (which was not deadly, but led to one of those long pauses in bootup).

The Results
Everything seems to run very well. The computer is happy, it talks to the internet (from both windows and linux) and, as usual, everything runs just a bit (or a lot, depending) on the linux side vs the windows side.

On KDE
I am a longtime KDE user, and I really like KDE in this distribution (downloaded and installed as packages in Ubuntu). I guess it is technically Kubuntu, but like I said, the undergrad was doing the installation so… I got to skip on thinking about this stuff.

Python for Perl Programmers (and Bioinformatics people)

Mark Bieda python getting started quick tips hints tutorial

I wanted to write a short post about getting started in python.

What you will like about Python as a perl person:
(1) A great thing is the interpreter. This will allow really rapid learning of python. For a perl person, python should come really fast. I was very, very surprised at how quickly I was writing actually useful (not toy) programs to manipulate things.
(2) It is easy to install in windows and has a decent editor/run environment (IDLE). Python is now a standard part of Linux distros, except for the smallest ones (perl is everywhere, so an advantage to perl here, but only a small one).

Some key things:
(1) The online manuals for python are good (but maybe not great). The Guido tutorial is key; make sure that you get the latest one.
(2) If you like to have a book on the python around (I always do for my programming language du jour), then make sure that you have the most recent one.
(3) Why the emphasis on the most recent? Python has added key new features in recent times – like even since version 2.4! So make sure that you have the latest documentation.

Installation and Usage:
(1) For windows people, use the IDLE editor. Really. You will find it very easy to use and efficient. It comes in the download, so no installation deal.
(2) To learn python really fast, just play with commands in the interpreter window. It really is easy and efficient – a very quick way to get up to speed on things.

Some key things for bioinformatics people, in particular:
(1) Sets. Sets are very nice. Intersection, union… all that stuff that you want to use.
(2) A lot of string manipulation functions (actually methods, technically) are available. These will do a lot of what you would do with regular expressions, but see the next point.
(3) Unfortunately, regular expressions are in an external (but standard library) and are a bit different from perl in usage/implementation.
(4) Like perl, the built-in sorting in python is weird (and annoying to set up to do anything beyond simple), but very useful. Again, here, make sure that you look at the latest documentation.
(5) Sqlite library is now part of the standard package. I haven’t used it yet as part of python – but given that this is a standard part of the distribution, it seems like I could write code that uses it and not worry about portability issues. This is well worth looking at for bioinformatics people.
(6) Remember that tuples are unchangeable (immutable) and lists are changeable. So far, this has led me to be pretty list-oriented, but I am new to this.

I’ll leave it at that for now. I’ll write more about python later on.

I wish I had… started with python earlier…

So far, my bioinformatics work has used a melange of perl, R, and bash scripting. While this has worked pretty well, it does have limits. For one, it is very not portable (bash scripting). I’ve already had problems with distributing software.

I wanted something that I could distribute in an easier way, yet had the advantages of perl. I found Jython, which is Python-in-Java. For me, the big deal is not use of Java libraries, but rather that the language would compile to Java byte-code and hence would be easy to distribute.

But I found that Python is much more than this: the interactive environment, for one, makes me ok with not having my unix/linux toolbox when I am stuck on the windows side.

And Python has a lot of nice features for bioinformatics work, including convenient types like sets (as of version 2.4) and even comes with sqlite (which I have not used from python, but want to)…

Anyways, for now, I am a fan.