Python and Bioinformatics and Perl: Chomp in python

Update: As many readers have commented, I have just missed the obvious – there are functions in python to do this. See comments section for details.

So I do a lot of file processing in my bioinformatics work and I’ve always really liked the perl function chomp.

I wanted to implement something in python to do this and something, that like the perl one, is able to handle multiple line endings (that is, Linux, Windows, and Mac line endings).

So this is chomp in python , in a sense a def chomp, but I rename it.

IMPORTANT: I am not guaranteeing in any way that this completely replicates chomp behavior. And, of course, this won’t work on more unusual systems that have different line ending conventions. In my work, I use UNIX/Linux, windows, and older mac stuff – so this works for those. And it handles ugly cases well, as you can see.

Enjoy! and comments welcome.

Also, this is not beautiful code! I threw this together because I was frustrated.

NOTE: you will have to adjust the tab spacing for the function to work; but you already know this… copying to HTML can be a pain…

>>> def chomppy(k):
    if k=="": return ""
    if k=="\n" or k=="\r\n" or k=="\r": return ""
    if len(k)==1: return k #depends on above case being not true
    if len(k)==2 and (k[-1]=='\n' or k[-1]=='\r'): return k[0]
    #done with weird cases, now deal with average case
    lastend=k[-2:] #get last two pieces
    if lastend=='\r\n':
        outstr=k[:-2]
        return outstr
    elif (lastend[1]=="\n" or lastend[1]=="\r"):
        outstr=k[:-1]
        return outstr
    return k

>>> chomppy(‘cow\n’)
‘cow’
>>> chomppy(”)

>>> chomppy(‘hat’)
‘hat’
>>> chomppy(‘cat\r\n’)
‘cat’
>>> chomppy(‘\n’)

>>> chomppy(‘\r\n’)

>>> chomppy(‘cat\r’)
‘cat’
>>> chomppy(‘\r’)

Python for Perl Programmers (and Bioinformatics people)

Mark Bieda python getting started quick tips hints tutorial

I wanted to write a short post about getting started in python.

What you will like about Python as a perl person:
(1) A great thing is the interpreter. This will allow really rapid learning of python. For a perl person, python should come really fast. I was very, very surprised at how quickly I was writing actually useful (not toy) programs to manipulate things.
(2) It is easy to install in windows and has a decent editor/run environment (IDLE). Python is now a standard part of Linux distros, except for the smallest ones (perl is everywhere, so an advantage to perl here, but only a small one).

Some key things:
(1) The online manuals for python are good (but maybe not great). The Guido tutorial is key; make sure that you get the latest one.
(2) If you like to have a book on the python around (I always do for my programming language du jour), then make sure that you have the most recent one.
(3) Why the emphasis on the most recent? Python has added key new features in recent times – like even since version 2.4! So make sure that you have the latest documentation.

Installation and Usage:
(1) For windows people, use the IDLE editor. Really. You will find it very easy to use and efficient. It comes in the download, so no installation deal.
(2) To learn python really fast, just play with commands in the interpreter window. It really is easy and efficient – a very quick way to get up to speed on things.

Some key things for bioinformatics people, in particular:
(1) Sets. Sets are very nice. Intersection, union… all that stuff that you want to use.
(2) A lot of string manipulation functions (actually methods, technically) are available. These will do a lot of what you would do with regular expressions, but see the next point.
(3) Unfortunately, regular expressions are in an external (but standard library) and are a bit different from perl in usage/implementation.
(4) Like perl, the built-in sorting in python is weird (and annoying to set up to do anything beyond simple), but very useful. Again, here, make sure that you look at the latest documentation.
(5) Sqlite library is now part of the standard package. I haven’t used it yet as part of python – but given that this is a standard part of the distribution, it seems like I could write code that uses it and not worry about portability issues. This is well worth looking at for bioinformatics people.
(6) Remember that tuples are unchangeable (immutable) and lists are changeable. So far, this has led me to be pretty list-oriented, but I am new to this.

I’ll leave it at that for now. I’ll write more about python later on.

I wish I had… started with python earlier…

So far, my bioinformatics work has used a melange of perl, R, and bash scripting. While this has worked pretty well, it does have limits. For one, it is very not portable (bash scripting). I’ve already had problems with distributing software.

I wanted something that I could distribute in an easier way, yet had the advantages of perl. I found Jython, which is Python-in-Java. For me, the big deal is not use of Java libraries, but rather that the language would compile to Java byte-code and hence would be easy to distribute.

But I found that Python is much more than this: the interactive environment, for one, makes me ok with not having my unix/linux toolbox when I am stuck on the windows side.

And Python has a lot of nice features for bioinformatics work, including convenient types like sets (as of version 2.4) and even comes with sqlite (which I have not used from python, but want to)…

Anyways, for now, I am a fan.

TAMALPAIS and promoter arrays

TAMALPAIS NimbleGen Promoter Arrays Array Analysis Problems Mark Bieda

I’ve been receiving some questions on TAMALPAIS usage for promoter arrays via email.

On the TAMALPAIS website, I say “Do not use this for promoter arrays.

This is actually not quite true; there are a limited number of cases in which TAMALPAIS will perform well for promoter arrays. In this post, I discuss this.

When TAMALPAIS is ok for promoter arrays:
In short:
1. If your factor only binds to a tiny portion of the promoters (<5%), then TAMALPAIS will perform ok.
2. More correct – and important – if only a small number of probes on the array are within binding sites for your factor, then you are ok. So: for promoter array designs with long promoters, you might have 15% of the promoters with a binding site. But only a small number of probes in the binding sites. (Hopefully this makes sense.)

Why do I say “Do not use TAMALPAIS for promoter arrays”?
If you have a factor that binds to (or exists in) a lot of promoter regions – like POLII or some histone modifications – then TAMALPAIS will give you bad results. I don’t want that to happen. Right now, study of histone mods and POLII are a big deal, so I don’t want people to be unhappy.

If not TAMALPAIS, then what?
There are a number of options. I developed maxfour to score promoters (see Krig et al. 2007 in JBC). I will be releasing an easy to use version of this software by the fall 2008 (planned, not a promise). This is really the best option with NimbleGen’s current crop of designs, in my opinion. Someone else may have some great promoter array analysis software; I’m not aware of this right now – feel free to email me or leave comments. I don’t mean to be unfair to other bioinformaticians with this.

What about the promoter array analysis server?
Ah, yes. This does very limited analysis – see my post on it in this blog (click the promoter array category button on the sidepanel).

TAMALg and TAMALPAIS: NimbleGen data analysis

Ok, I wanted to write about the relationship between TAMALPAIS and TAMALg.

keywords: Mark Bieda, TAMALPAIS, TAMALg, NimbleGen, ChIP-chip

Background
A major part of my research is developing algorithms and statistical models for analysis of ChIP-chip experiments – specifically those done with NimbleGen arrays.
TAMALPAIS (available here) predicts binding sites from NimbleGen array data and also does some basic secondary analyses like localization of binding sites in reference to transcription start sites and which genes have a binding site in the proximal promoter. The website version gives a lot of output.
TAMALg (TAMALpais generalized) recently was ranked #1 in an unbiased competition between algorithms. It uses the same exact prediction approach as TAMALPAIS (technically, it uses the L2L3combo set of predictions – to get these predictions, go to the TAMALPAIS website here). Then, in a second step, it uses the maxfour approach that I developed for promoter arrays (Krig et al., 2007 in JBC) to predict the actual amount of enrichment per binding site.

So the relationship between the TAMALPAIS and TAMALg is this:
TAMALPAIS produces the same high-quality peak predictions as TAMALg (and I say high quality because the competition showed this; see this paper abstract). But TAMALPAIS does not do the enrichment prediction. Remember to look at the L2L3combo set from TAMALPAIS to get the same predictions as TAMALg.

Future Stuff
I am planning on producing a downloadable version of TAMALg (probably Jython-based so that it will easily run on many platforms).

Remember! TAMALPAIS and TAMALg are not good for most promoter arrays!

If you have questions, you should contact me (see About tab on this site for contact info),

NCBI GEO submission: howto hints

Ok, NCBI GEO submission of data can be a pain. I mean a big pain.
But there are a few simple things that can make it less painful.

here are my hints and a few steps:

1. Don’t assume that you will get the submission right the first time; it’s easy to have errors.
2. DO assume that NCBI will contact you requesting more information on some things. Be ready.
3. DO save all relevant files; as #2 says, you may get contacted.

And importantly:
4. Remember: some of the annoyance of the system is to ensure that in 5 years… or 10 years, your data will still be comprehensible. As opposed to having it in some weird vendor-specific format… So be patient.
5. Put that you did NCBI GEO submission on your resume. It can’t hurt.

Key Making it easier hints
1. Do all submission when the people generating the data are around. You will be surprised at little things that you need to add that are unclear.
2. You will need all the files for the experiments – you have to put raw files in as a supplement. So get the files together as much as possible.

The Steps: A Protocol
1. Search GEO for an entry that has the exact same type of data/type of array that you are submitting. This will save you huge amounts of time. You don’t want to have to redefine a platform file – it is annoying and will just cost you time and energy. And make the system worse.
2. After finding that file, you will have the platform file (the GPL file number) for the array type that you are using. Make a clear note of this!
3. (Note: there may be better ways to do this, but this works for me) Download the sample file that you found in SOFT format in full. The SOFT format makes uploading files way faster and easier.
4. The SOFT format is a text-format and the opening lines are clear fields. Open the file in a text editor (note: for windows, download and install Notepad++ to do this; it will save you a lot of pain).
5. Cut away the header (maybe 30 or 50 lines) and make a new file. Edit this file with the parameters of your experiment.
6. The hard part is this: you have to make a data file that corresponds to the platform file IDs. This is beyond the scope of this blog post; maybe I will add something about this later.
7. Make a zip file of all the supplementary files (these are the raw data files). I’ll call this SUPP.zip
8. Edit the header to reflect that you are putting in a supplementary file and add the name of this file.
9. Add your header to the datafile (made in step #6). At the end of the datafile, you need an end line. Add this. Save this file. (Again, in windows, Notepad++ is the way to go for this.) I’ll call this file FORGEO.txt
10. Create a second zip archive (I’ll call it TOTAL.zip) containing:
a. FORGEO.txt
b. SUPP.zip
c. Note: this means that TOTAL.zip has exactly two files in it (FORGEO.txt and SUPP.zip).
11. Using the validation option, upload ONLY FORGEO.txt to see if it validates. This is important! It will save you a lot of time to do this. You will get an error about a missing supplementary file, but don’t worry about that.
12. Using direct submission, submit TOTAL.zip using the SOFT option. This will take a long time to load, generally. You will get a screen asking if FORGEO.txt or SUPP.zip is the datafile. Choose FORGEO.txt.
13. You are done with one submission!
14. I suggest that you actually use more informative names than FORGEO.txt and SUPP.zip and TOTAL.zip. I actually name the files with the array number. Like 85012.txt, 85012_supp.zip and 85012_total.zip.
15. IMPORTANT: if you have a lot of files or just big files, the FTP option is best.

TAMALPAIS: howto open files

key words: TAMALPAIS, NimbleGen, Mark Bieda, ChIP, server

Background:
TAMALPAIS is the webserver that I created to analyze NimbleGen ChIP-chip data (note that it is not for promoter data). You can find it at:

http://chipanalysis.genomecenter.ucdavis.edu/cgi-bin/tamalpais.cgi

I’ve received queries from a number of people on opening files from my TAMALPAIS server.

Some people have trouble opening the files from the TAMALPAIS server, here are instructions:

Mac:
1. on the mac (modern macs with OSX, not ancient macs), this should be easy – just click on the file

Windows:
(one option: transfer the files to a Mac (see above). If you don’t want to do this (I wouldn’t), then continue)
1. download the FREE 7ZIP program from www.7-zip.org
2. install 7ZIP
3. right-click on the file from TAMALPAIS and select 7ZIP from the menu, select “Open archive”
4. click on the files that show up in the archive window. At any point, you can click on the “extract” button in the toolbar in the window (it is the the large “minus sign” that is blue/purple).
5. for any of the files ending with .tar.gz, or ending with .tar, or ending with .zip, you can continue to do this procedure (starting with step #3).

There are a bunch of files in subarchives (that is, in other .tar.gz files within the archive).

Problems?

If you have problems, contact me using the contact information on the About page of this blog.