Slightly more advanced scripting with Python

There are two really useful data types in Python that you should know how to use, lists and dictionaries.

Lists

You already have some experience with lists: they’re what we’re stuffing data points into in Plotting with matplotlib with ‘data.append’. A list just contains an ordered list of things, such as numbers and/or strings. Note that lists can contain multiple types; a list doesn’t need to contains just numbers.

Here’s some example Python code:

>>> x = []
>>> x.append('foo')
>>> x.append(5)
>>> print x
['foo', 5]

>>> for val in x:
...   print val
foo
5

Lists are great ways to keep track of random stuff, and because they’re ordered you can break rows into multiple lists and still keep track of row correlations. For example, as long as you add things in in a consistent order, you can retrieve them in the same order:

>>> x = []
>>> y = []
>>> x.append(1)
>>> x.append(2)

>>> y.append(6)
>>> y.append(7)

>>> print x[0], y[0]
1 6
>>> print x[1], y[1]
2 7

You can use for loops to go through multiple lists using zip:

>>> for xval, yval in zip(x, y):
...   print xval, yval
1 6
2 7

Think of it as joining the lists with a zipper...

Dictionaries

Dictionaries are probably the single most useful data structure ever for bioinformatics! Basically dictionaries let you link a unique key (name, id, number, whatever) to a Python object – which can be pretty much anything, like a number, string, list, or another dictionary.

This is extremely useful for situations in which you need to correlate data from different data sets, where the data sets have a common key. If the data rows are in a different order in the different data sets, then you’ll need to use a dictionary for this purpose.

We’ve already seen one example of dictionary use in the reciprocal BLAST code, find-reciprocal.py, where we’re loading in two data sets – BLAST of zebrafish v mouse, and mouse v zebrafish –

The four essential commands are:

>>> d = {}

to create a new dictionary,

>>> key = 'foo'
>>> value = 'bar'
>>> d[key] = value

to set the value for the key ‘foo’ equal to ‘bar’,

>>> val = d[key]
>>> val
'bar'

to retrieve the value associated with the key ‘foo’ (which will be ‘bar’)

One really important thing about dictionaries is that there are a lot of Python libraries that look like dictionaries (a.k.a. “follow the mapping protocol). Pretty much any data container in Python can be made to look like a dictionary.

For example...

More FASTA stuff, with screed

Warning

These documents are not maintained and their instructions may be out of date. However the GED Lab does maintain the khmer protocols which may cover similar topics. See also the installation instructions for the current version of the khmer project.

...we can use the screed FASTA/FASTQ parser library to make FASTA and FASTQ files look like dictionaries.

Let’s start by installing screed:

%% cd ~/
%% apt-get -y install git-core
%% git clone http://github.com/acr/screed.git
%% cd screed
%% python setup.py install

Now let’s get some reasonably big FASTA file, like our mRNAseq library from lamprey work:

%% mkdir /mnt/fasta
%% cd /mnt/fasta
%% curl -O http://angus.ged.msu.edu.s3.amazonaws.com/lamprey-rnatigs.fa.gz
%% gunzip lamprey-rnatigs.fa.gz

Now index it:

%% python ~/screed/screed/fadbm.py lamprey-rnatigs.fa

(Use ‘fqdbm.py’ for FASTQ files.)

From now on, you can use ‘screed’ to access sequences from this file as if they’re in a dictionary.

One more script: RPKM adjustment

One reasonably common thing to do with mRNAseq is adjust expression levels by the size of the mRNA transcript, e.g. “reads per kilobase of mRNA”. As we discussed, you still can’t numerically compare these expression levels within a sample, but it is possible to get some idea of relative expression levels by normalizing to the transcript length. (You also still need to do quantile normalization to compare between samples, too!)

In order to do this adjustment, you need two things: the original count per molecule, and the length of the sequence. However, because the data isn’t “in order” – that is, the counts aren’t listed in the same order as the sequences are in the FASTA file – you have to be able to look up the length of any given FASTA sequence. So let’s use screed to do this!

First, update your ‘ngs-course’ directory:

%% cd ~/ngs-course
%% hg pull -u

Now, go back to the ‘fasta’ directory:

%% cd /mnt/fasta

and grab an mRNAseq ‘count’ file:

%% curl -O http://angus.ged.msu.edu.s3.amazonaws.com/lamprey-mrnaseq-a.count.gz
%% gunzip lamprey-mrnaseq-a.count.gz

Now, run the ‘rpkm.py’ script on the database and count file:

%% python ~/ngs-course/mrnaseq/rpkm.py lamprey-rnatigs.fa lamprey-mrnaseq-a.count

The key part of the rpkm script is the following code:

seqdb = screed.ScreedDB(sequence_database)

for line in open(counts_file):
    count, name = line.strip().split()  # parse lines like '1523 geneX'
    count = int(count)

    # look up the sequence in the seqdb dictionary-like database.
    sequence_length = len(seqdb[name].sequence)

where in the last line we look up the sequence record in the database, as if it were a dictionary.

Further reading

There’s a plethora of good material out there for learning how to program in Python:

Honestly, start with the Python tutorial.

Think Python: How to Think Like a Computer Scientist

Intermediate and Advanced Software Carpentry in Python has a few good student-oriented tips, although it’s out of date.


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.
comments powered by Disqus