TAMALg: is the package available?

I’ve received a lot of questions recently about TAMALg availability. Unfortunately, there is only a difficult-to-install package available right now; I sent it to someone recently and they had a terrible time getting it going.

I do describe the algorithm in the supplementary materials to the ENCODE spike-in competition paper (Johnson et al, Genome Research 2008).

I would love to have a simple package to distribute, but this is little supported in today’s granting environment; in fact, I don’t think that making algorithms widely available has ever been well-supported by any US funding agency. And I doubt the situation is different here in Canada.

I may be getting another undergrad soon and would task that person with working on the package. As a new faculty member, I am simply overwhelmed with basics like getting my lab going right now.

I do hope that this situation changes and thanks to all for patience.

As I have noted previously, the L2L3combo predictions produced by the TAMALPAIS server (see previous posts on this or just search for “TAMALPAIS Bieda” – no quotes, though) are the same predictions as made by TAMALg. TAMALg also adds the step of estimating enrichment via using maxfour type methodology.

So you can get good TAMALg predictions of sites just by using the webserver. I suggest going this route.

And to repeat – TAMALg is almost certainly NOT what you want for promoter arrays. Except if you have a factor in only a tiny fraction of promoters or one of the newer designs with very long promoter regions (e.g. for 10 kb promoters, might be ok).

Commentary: ChIP-chip vs ChIP-seq and $$

Ok, so with the rush to ChIP-seq and all the hype (much of it deserved) around “next-generation” sequencing generally, you might think that arrays are dead as used for ChIP (i.e. ChIP-chip).

I don’t think this is going to happen for simple cost reasons. For the near future, there will be lots of genome-scale ChIP studies and, for these, I strongly support ChIP-seq. It is a lot cheaper for better data. But I see a strong trend toward ChIP studies targeted toward specific biological questions and often questions requiring large sample numbers (e.g. epigenetic changes is cancer).

The financial math really isn’t that hard; with ChIP-seq running ~$5000 for external users and ChIP-chip running at $660 for external users (NimbleGen single arrays), it seems pretty clear that if a fair number of samples are involved, ChIP-chip is the way to go. That is, unless high-res whole genome coverage is absolutely necessary (usually not).

Furthermore, for taking chances on experiments, $660/sample is a lot more appealing on a lab budget than $5000/sample, particularly when you consider that, in the real world, even poor testing of a speculative idea is going to take 2 or 3 samples at minimum (=~$15,000 for ChIP-seq vs $1980 for ChIP-chip). A lot of labs can blow $2000; blowing $15,000 really hurts.

Given this analysis, it seems to me that NimbleGen should really push the low end of the market – in other words, try to get the cost even lower on a per sample basis (for fewer spots). I think they are on the right track with their multiplex arrays, but development of these has been disappointingly slow, and last time I looked, the cost structure around the 4plex with 70K/quadrant really wasn’t very attractive.

I may revisit this topic another time, but that is it for now.

Viewing Large Text Files (like big GFF files) in Windows

I know, I know, many of you will say “just use Linux”. And this is true, but SignalMap from NimbleGen, which is quite convenient for viewing GFF files of ChIP-chip data, is just a windows product (and yes, I did try WINE as an emulator).

So if you try to load a 380,000 line file into Notepad (or even a much smaller file), Notepad will blow up. And wordpad even acts bad.

The good – no, great – free windows solution is Notepad++. Available here.
I’ve used it for a few years; it works great. Will easily load multiple 380,000 line files (like 40 Mb GFF files).

Notepad++ also fulfills other requirements for me: it clearly has a large-ish user base and is constantly being updated/upgraded. So it is a robust, free product.

Other good bits:
(1) will convert from Windows to Mac to Unix line endings
(2) automatically recognizes the line-ending type (important for looking at files)
(3) very good syntax highlighting for a wide variety of programming languages
(4) tabbed files means that you can easily switch from file to file
(5) it retains memory of your open files – so they will be there each time you open it
(6) good behavior when you move/change a file that you are editing – will ask you to reload/save/etc.

Some not-so-good bits:
(1) for big files, regular expression stuff is slow

Mark Bieda windows big files viewing gff files NimbleGen tutorial howto notepad problems

TAMALPAIS and promoter arrays

TAMALPAIS NimbleGen Promoter Arrays Array Analysis Problems Mark Bieda

I’ve been receiving some questions on TAMALPAIS usage for promoter arrays via email.

On the TAMALPAIS website, I say “Do not use this for promoter arrays.

This is actually not quite true; there are a limited number of cases in which TAMALPAIS will perform well for promoter arrays. In this post, I discuss this.

When TAMALPAIS is ok for promoter arrays:
In short:
1. If your factor only binds to a tiny portion of the promoters (<5%), then TAMALPAIS will perform ok.
2. More correct – and important – if only a small number of probes on the array are within binding sites for your factor, then you are ok. So: for promoter array designs with long promoters, you might have 15% of the promoters with a binding site. But only a small number of probes in the binding sites. (Hopefully this makes sense.)

Why do I say “Do not use TAMALPAIS for promoter arrays”?
If you have a factor that binds to (or exists in) a lot of promoter regions – like POLII or some histone modifications – then TAMALPAIS will give you bad results. I don’t want that to happen. Right now, study of histone mods and POLII are a big deal, so I don’t want people to be unhappy.

If not TAMALPAIS, then what?
There are a number of options. I developed maxfour to score promoters (see Krig et al. 2007 in JBC). I will be releasing an easy to use version of this software by the fall 2008 (planned, not a promise). This is really the best option with NimbleGen’s current crop of designs, in my opinion. Someone else may have some great promoter array analysis software; I’m not aware of this right now – feel free to email me or leave comments. I don’t mean to be unfair to other bioinformaticians with this.

What about the promoter array analysis server?
Ah, yes. This does very limited analysis – see my post on it in this blog (click the promoter array category button on the sidepanel).

On the promoter array analysis server

NimbleGen Promoter Arrays Mark Bieda server Analysis

Note: minor corrections on June 9, 2008
The promoter array server is located at this site

IMPORTANT USAGE NOTE: USE FIREFOX (Internet explorer appears to create issues)

What does it do?

1. This does a simple list comparison using NimbleGen .tab files
In other words, it just outputs the number of entries that are the same in the lists for the top100, top 200, etc.
2. this is a very simple application; just a convenience, really
3. this does not do an analysis like TAMALPAIS (see the category on TAMALPAIS on this blog).

File format:
this is based on the .tab file format from NimbleGen; it has to have a dummy line to begin.
here is a sample of an ok file format:

first dummy line
genenameholder CHR10_100017497_100020197 maxfourv02 0.2025
genenameholder CHR10_100164431_100167131 maxfourv02 0.7775
genenameholder CHR10_100196194_100198894 maxfourv02 0.6625

Notes on the format:
1. IMPORTANT: all fields are separated by tabs
2. The first field can vary and be meaningful.
3. The third field can vary and be meaningful.
4. The second field is the promoter name used for comparisons
5. The fourth field is the promoter value (numerical value) used for sorting (that is, determining order).
6. It’s ok to have more fields than the four. In other words, files of 10 fields are ok too. But the program will only look at the second and fourth fields.

Notes on usage:
1. the data can be entered unsorted

TAMALPAIS known limitation: must be by chromosome



1. The first field of the gff file must be by chromosome; in particular, it probably needs to be like chr1
or like chr1, chrX, chrY, chr20.

Further details:

I suspect (but am not sure) that anything of the form chr(anything) will work. But I am not sure of this. Note that use of non-standard chr names do have the limitation that the optional secondary analyses like location and gene finding would not work.

Non-standard name examples:
Like chr99 might be ok. Or chrMYGOODONE.

What am I talking about?

If you look at the first lines of your gff, you will see in the first column the location designation. For most gffs, this is like chr1. To see examples, go to the sample data page on the website. You will see that these files are by chr.

To look at your own gff files, it is easy to load them into a text editor in Linux, or for windows, I strongly suggest that you use the excellent Notepad++ (do a google search, it is completely free).

TAMALg and TAMALPAIS: NimbleGen data analysis

Ok, I wanted to write about the relationship between TAMALPAIS and TAMALg.

keywords: Mark Bieda, TAMALPAIS, TAMALg, NimbleGen, ChIP-chip

A major part of my research is developing algorithms and statistical models for analysis of ChIP-chip experiments – specifically those done with NimbleGen arrays.
TAMALPAIS (available here) predicts binding sites from NimbleGen array data and also does some basic secondary analyses like localization of binding sites in reference to transcription start sites and which genes have a binding site in the proximal promoter. The website version gives a lot of output.
TAMALg (TAMALpais generalized) recently was ranked #1 in an unbiased competition between algorithms. It uses the same exact prediction approach as TAMALPAIS (technically, it uses the L2L3combo set of predictions – to get these predictions, go to the TAMALPAIS website here). Then, in a second step, it uses the maxfour approach that I developed for promoter arrays (Krig et al., 2007 in JBC) to predict the actual amount of enrichment per binding site.

So the relationship between the TAMALPAIS and TAMALg is this:
TAMALPAIS produces the same high-quality peak predictions as TAMALg (and I say high quality because the competition showed this; see this paper abstract). But TAMALPAIS does not do the enrichment prediction. Remember to look at the L2L3combo set from TAMALPAIS to get the same predictions as TAMALg.

Future Stuff
I am planning on producing a downloadable version of TAMALg (probably Jython-based so that it will easily run on many platforms).

Remember! TAMALPAIS and TAMALg are not good for most promoter arrays!

If you have questions, you should contact me (see About tab on this site for contact info),