Featured

I’m back to blogging in 2017

So… I’ve been busy. Lots of stuff including some published work (see http://www.markbieda.com).

But I’m back to blogging So much fun stuff to discuss: docker, R, NGS…

Advertisements

R tip #3: str() is the most important function in R

Several years ago, I read somewhere that str() could be called the most valuable or useful function in R.

I thought “yeah, right, I’ve been programming in R for more that 10 years – this is ridiculous”.

Now, when I teach R… I teach str() immediately.

Why?
str() gives a great summary of information on an object.
99+% of time (except in code itself), str() is preferable to class(). str() will help a bioinformatician get quick information on an object.

Example:

data(mtcars)
class(mtcars)
## [1] "data.frame"
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

I tell students to let str() guide their subsetting of dataframes and of S4 classes.

R tip #2: Learn dataframe subsetting in R in 5 minutes

To subset a dataframe: to apply conditions to extract a subset of the data that is in the dataframe.

Example: a company has employees listed in a dataframe and the employees work in either Texas or California. You may want to subset the dataframe to get only employees in Texas.

There are several ways to subset dataframes in R. There is a well established and very flexible system in base R. The “tidyverse” (specifically dyplyr) offers another approach, which is also quite flexible.
Although the tidyverse approach continues to grow in popularity, most online examples of R code use traditional subsetting.
Traditional subsetting of dataframes in R can seem bizarre. Every year, I find that students struggle with this.

In this post, I’ll talk about the logic of subsetting in a few steps.

LOAD DATA

data(mtcars)

IMPORTANT: for mtcars, the car name is actually the row name. So the car name is not a column! (But you can always get car names by executing rownames(mtcars))

Let’s look at the rownames and the columnnames

rownames(mtcars)
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"
colnames(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

0. Remember: R numbering starts with 1, not zero!

1. Fundamental syntax of subsetting

We subset mtcars by setting a row condition and a column condition.

mtcars[row_condition,column_condition]

row_condition: an expression telling us how to filter the rows. For example, we may want rows 5,8, and 12.
column_condition: an expression telling us how to filter the columns. For example, we may want only the columns 1 and 4.

2. Simple subsetting by number

I want data from row 7.
The row_condition is 7.
The column_condition is everything, which is represented just with no character or a space.

#this option - preferred formatting with space
mtcars[7, ]
##             mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Duster 360 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4
#this gives same output (no space used in statement)
mtcars[7,]
##             mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Duster 360 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4

Second example:
To get just column 7,

mtcars[ ,7]
##  [1] 16.46 17.02 18.61 19.44 17.02 20.22 15.84 20.00 22.90 18.30 18.90
## [12] 17.40 17.60 18.00 17.98 17.82 17.42 19.47 18.52 19.90 20.01 16.87
## [23] 17.30 15.41 17.05 18.90 16.70 16.90 14.50 15.50 14.60 18.60

3. More advanced subsetting by multiple numbers

I want data from rows 7 and 12.
I want columns 3 and 8.
To get rows 7 and 12, we set the row_condition to the vector c(7,12).
To get columns 3 and 8, we set the column_condition to the vector c(3,8).

mtcars[c(7,12),c(3,8)]
##             disp vs
## Duster 360 360.0  0
## Merc 450SE 275.8  0

4. Simple true and false subsetting

If the value is TRUE we keep the row.
Let’s make a simple subset of the mtcars dataframe to demonstrate this just using our numeric subsetting; call it mtsmall.

#make a small dataframe with first 3 rows and first three columns; this does not use true/false approach
mtsmall <- mtcars[c(1,2,3),c(1,2,3,4)]
mtsmall
##                mpg cyl disp  hp
## Mazda RX4     21.0   6  160 110
## Mazda RX4 Wag 21.0   6  160 110
## Datsun 710    22.8   4  108  93

What if we want just rows 1 and 3?
We can use c(TRUE, FALSE, TRUE) for the row_condition.

#only rows 1 and 3 and ALL columns (from reduced dataframe)
mtsmall[c(TRUE,FALSE, TRUE), ]
##             mpg cyl disp  hp
## Mazda RX4  21.0   6  160 110
## Datsun 710 22.8   4  108  93

5. Conditional subsetting

Let’s go back to mtcars, the full dataframe.
What if we want to get all information for cars with mpg>20?
row_condition seems like it would be “mpg>20”. But this won’t work – we need to supply a true or false for each row.
column_condition is everything. Everything leads to a space or nothing.

BUT: the row condition will seem strange.
Note that “mtcars$mpg>20” leads to a series of true and false. For every true, we keep that row. For a false, we eliminate the row.

mtcars$mpg>20
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [23] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE

row_condition is “mtcars$mpg>20”
column_condition is everything. Everything is indicated by a space or nothing.

#all columns for cars with mpg>20
mtcars[mtcars$mpg>20, ]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

6. Excluding rows or columns

For dataframes, -n means exclude row (or column) n.
This is quite useful when there are some observations that are faulty.

mtcars[-5, ] return the dataframe with all rows except row 5 and all columns

#all data except row 5
mtcars[-5, ]
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

The same sort of logic holds for excluding columns.

#all columns except 5; all rows
mtcars[ ,-5]
##                      mpg cyl  disp  hp    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 2.780 18.60  1  1    4    2

R tip #1: use paste0 instead of paste

You expect that concatenating two strings will yield the strings “mashed together”. But paste is annoying (or convenient) in that it adds a separator by default (space). paste0 doesn’t do this, so it acts more like conventional concatenation which is simply more convenient, usually. So I advocate for the oddly named paste0 to be your default paste.

An example:

paste("var","First")

Output:
[1] "var First"

paste0("var","First")
Output:
[1] "varFirst"
 

 

Why use Docker for bioinformatics?

I’ve been using Docker a lot recently for my bioinformatics work. I’ve really fallen in love with it as a way of quickly getting things going.

Here’s why!

You want to install and use a bioinformatics software package… Now!

First, you have to find the package to begin with – which usually means navigating someone’s webpage or GitHub page and figuring out what to do and how to install. It’s not really fun.
Second, you can end up in dependency nightmare land, where you just go from updating one thing to installing something else..
Bottom Line: This just takes time and (mental) energy.

With Docker: You find the Docker image, which is very fast in practice. You install the image – a simple standard one-line procedure… and then you use the package!

You want to use a Linux package under Windows or on a Mac

Without Docker: you have to have a virtual machine or somehow install using the new bash for windows systems (hint: don’t do this second option).
With Docker: You just “run the image” (actually, create the container), so you can easily use Linux-only software under windows.
Also, although Mac OSX is so so similar to Linux in the internals… it’s not Linux. So you can definitely run into problems.
Bottom Line: Docker makes this very straightforward. No thinking.

You want to have multiple versions of a software package available for usage.

Yes, this sounds weird but in practice, I find that I often wish that I had the newer or older version. I was recently using GATK and found that I wanted to have GATK3 and GATK4 available. With Docker, I just have separate images for each. (Sure, there are easy ways around this but Docker just makes it simple to have the multiple versions organized). Again, in practice the mental energy counts.
Bottom Line: Docker makes this simple and keeps the separate versions separate and clear. The easy way.

My Experience

I have a dual-boot machine. I also have virtual machines (both VMware and VirtualBox) for Linux. I also have “Bash on Ubuntu on Windows”. I have used all of these. I’ve been dual-boot for >10 years and I’ve used virtual machine approaches for 10 years. I just get tired of playing with system configuration/memory/etc issues when I want to do my real work.
And I really like that I download the image and just start using software. So I am a convert to the Docker world for now, at least.
Bottom Line: I’m a fan!

Docker for Bioinformatics: An enormous set of images (3007! at last count)

keywords: docker, bioinformatics, software

I’m a huge fan of docker (like most everyone, it seems). My lab has been working on some docker images and pipelines including our custom code (not released yet). I’ve been using a lot of docker images to do quick analysis – I’ll write more on this in another post.

I just ran across an enormous set of “dockerized” bioinformatics software – look at

https://biocontainers.pro/registry/#/

As of today, there are 3007 images and they seem to encompass a lot of popular packages – like samtools.

I really like the documentation of the images and dockerfiles on this site – very easy to see what is actually in the image.

One issue: some packages are frequently updated – and the updates are important but the images are a bit behind. So be careful with version issues. The bcftools image is at least one version behind, for example.

Always, comments welcome.

2009 post: Key Bioinformatics Computer Skills

Note: this was written in 2009 so… out of date somewhat!

I’ve been asked several times about which computer skills are critical for bioinformatics. Important – note that I am just addressing the “computer skills” side of things here. This is my list for being a functional, comfortable bioinformatician.

  1. SQL and knowledge of databases. I always recommend that people start with MySQL, because it is crossplatform, very popular, and extremely well developed.
  2. Perl or Python. Python wins now! (2017 update!)  Preferably perl. It kills me to write this, because I like python so much more than perl, but from a “getting the most useful skills” perspective, I think you have to choose perl.
  3. basic Linux. Actually, being at a semi-sys admin level is even better. I always tell people to go “cold turkey” and just install Linux on their computer and commit to using it exclusively for a while. (Due to OpenOffice etc, this should be mostly doable these days). This will force a person to get comfortable. Learning to use a Mac from the command line is an ok second option, as is Solaris etc. Still, I’d have to say Linux would be preferred.
  4. basic bash shell scripting. There are still too many cases where this ends up being “just the thing to do”. And of course, this all applies to Mac.
  5. Some experience with Java or other “traditional languages” or a real understanding of  modern programming paradigms. This may seem lame or vague. But it is important to understand how traditional programming languages approach problems. At minimum, this ensures some exposure to concepts like object-oriented programming, functional programming, libraries, etc. I know that one can get all of this with python and, yes, even perl – but I fear that some many bioinformatics people get away without knowing these things to their detriment.
  6. R + Bioconductor. So many great packages in Bioconductor. Comfort with R can solve a lot of problems quickly. R is only growing; if I could buy stock in R, I would!

This may seem like a lot, but many of these items fit together very well. For example, one could go “cold turkey” and just use Linux and commit to doing bioinformatics by using a combination of R, perl and shell scripting, and an SQL-based database (MySQL). It is very common in bioinformatics to link these pieces, so… not so bad, in the end, I think.

As always, comments welcome…