A tutorial in basic digital normalization

Date:March 25, 2012


These documents are not maintained and their instructions may be out of date. However the GED Lab does maintain the khmer protocols which may cover similar topics. See also the installation instructions for the current version of the khmer project.

See http://arxiv.org/abs/1203.4802 for more information on digital normalization.

First, start up an EC2 instance using ami-999d49f0 running on m2.xlarge (17.1 GB of RAM). See (for example) Starting your cloud system for basic information on doing so & on connecting in to it.

Now, ssh in.

Installing khmer

You’ll need to install both screed and khmer:

% git clone git://github.com/ged-lab/screed.git
% git clone git://github.com/ged-lab/khmer.git

% cd screed
% python setup.py install
% cd ..

% cd khmer
% make test
% cd ..

% export PYTHONPATH=/root/khmer/python

Getting test data

Next let’s get the tutorial data. Basically, just make an EC2 volume from snap-9d295ee7, attach as sdf, and mount as /data:

% mkdir /data
% mount /dev/xvdf /data


% ls /data

should show

README.txt  ecoli_ref.fq.gz  lost+found  schizo-50m.fq.gz

Three-pass diginorm for microbial genomic DNA

We’ll start by working with a small E. coli data set from the Velvet-SC paper (Chitsaz et al., 2011)

For genomic DNA, we will do a three-pass approach (see http://arxiv.org/abs/1203.4802). First, do a first round of digital normalization to C=20:

% cd /mnt
% /root/khmer/scripts/normalize-by-median.py -C 20 -k 20 -N 4 -x 2.5e8 --savehash ecoli_ref.kh /data/ecoli_ref.fq.gz

(wait a while...) ...this should eliminate about 2/3 of the data.

Next, trim low-abundance k-mers:

% /root/khmer/scripts/filter-abund.py ecoli_ref.kh ecoli_ref.fq.gz.keep

(wait a while...) ...this should get rid of another 25% or so.

Finally, do the second round of normalization to C=5

% /root/khmer/scripts/normalize-by-median.py -C 5 -k 20 -N 4 -x 1e8 ecoli_ref.fq.gz.keep.abundfilt

(wait a while...) ...this will get rid of another 75%, leaving under 400,000 sequences of the original 5m.

And voila – the sequences to assemble are in ‘ecoli_ref.fq.gz.keep.abundfilt.keep’, in FASTA format. See Short Read Assembly with Velvet.

If you want to split into paired and orphan reads, do:

% python /root/khmer/sandbox/strip-and-split-for-assembly.py ecoli_ref.fq.gz.keep.abundfilt.keep

This will output .pe and .se files that you can feed into Velvet as -fasta -shortPaired and -fasta -short, respectively. You will want to try out a bunch of different k parameters for the assembly, note.

Assembling with Velvet

First, grab and build velvet:

% curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.03.tgz
% tar xzf velvet_1.2.03.tgz
% cd velvet_1.2.03/
% cp velvet? /usr/local/bin
% cd ../

Now, run it for a bunch of k values to figure out which one is optimal:

% for i in {19..51..2}; do
    velveth ecoli.kak.$i $i -fasta -short ecoli*.se -shortPaired ecoli*.pe;
    velvetg ecoli.kak.$i -exp_cov auto -cov_cutoff auto -scaffolding no;

This should take about an hour to complete for the 16 different assemblies. After it’s done, you can look at the results; khmer comes with a script that we use for this purpose:

% /root/khmer/sandbox/assemstats3.py 1000 ecoli.kak.*/contigs.fa
** cutoff: 1000
N       sum     max     filename
285     4493169 96567   ecoli.kak.19/contigs.fa
249     4503372 110013  ecoli.kak.21/contigs.fa
232     4508404 109999  ecoli.kak.23/contigs.fa

Here, the first number is the number of contigs greater than the cutoff (1000); the second number is the sum of bases in contigs greater than the cutoff; and the third number is the maximum contig size.

And voila, you have your assemblies!

Single-pass diginorm for mRNAseq

For the yeast mRNAseq, we’ll do a single-pass approach on the 100m read data set from the Trinity paper (Grabherr et al., 2011.) Again, see http://arxiv.org/abs/1203.4802 for discussion.


% /root/khmer/scripts/normalize-by-median.py -C 20 -k 20 -N 4 -x 2e9 /data/schizo-50m.fq.gz

This will get rid of about 90% of the data in the 100m read data set. It will probably take an hour or two...

Now, you’ll be left with the data in schizo-50m.fq.gz.keep.

For Oases, you will want to split these into paired and orphan reads files:

% python /root/khmer/sandbox/strip-and-split-for-assembly.py schizo-50m.fq.gz.keep

which will produce .pe and .se files. Note that you will want to try a range of k values for the diginorm files.

For Trinity, you may want to split these into /1 and /2 ended files:

% python /root/khmer/sandbox/split-pe.py schizo-50m.fq.gz.keep

which will produce .1 and .2 FASTA files that you can then feed into Trinity.

Installing and running Oases

Make sure you’ve downloaded and installed Velvet, as above. Now build and install Oases:

% curl -O http://www.ebi.ac.uk/~zerbino/oases/oases_0.2.06.tgz
% tar xzf oases_0.2.06.tgz
% cd oases_0.2.06
% make VELVET_DIR=../velvet_1.2.03
% cp oases /usr/local/bin
% cd ..

Now, run Velvet and Oases:

% velveth yeast.oases.21 21 -fasta -short schizo*.se -shortPaired schizo*.pe
% velvetg yeast.oases.21 -read_trkg yes
% oases yeast.oases.21

(this will take about 30 minutes)

... and voila, you have your assembly in ‘yeast.oases.21/transcripts.fa’. To get some basic stats, do:

% python /root/khmer/sandbox/assemstats3.py 300 yeast.oases.21/transcripts.fa
** cutoff: 300
N       sum     max     filename
9969    29946663        18764   yeast.oases.21/transcripts.fa

Again, the first number is the number of contigs greater than the cutoff (1000); the second number is the sum of bases in contigs greater than the cutoff; and the third number is the maximum contig size.

Installing and running Trinity

Download and install Trinity:

% cd /mnt
% curl -L 'http://downloads.sourceforge.net/project/trinityrnaseq/trinityrnaseq_r2011-11-26.tgz?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Ftrinityrnaseq%2F&ts=1325692775&use_mirror=cdnetworks-us-2' > trinityrnaseq_r2011-11-26.tgz
% tar xzf trinityrnaseq_r2011-11-26.tgz
% cd trinityrnaseq_r2011-11-26
% make

Now, run it – but make sure to set your stack size to unlimited!

% ulimit -s unlimited

% mkdir yeast
% cd yeast
% ln -fs ../../*.[12] .
% ../Trinity.pl --seqType fa --left schizo*.1 --right schizo*.2 --SS_lib_type RF

(this will take about three hours)

Your assembly will be in trinity_out_dir/Trinity.fa.

Now check out the assembly:

% python /root/khmer/sandbox/assemstats3.py 300 trinity_out_dir/Trinity.fasta
** cutoff: 300
N       sum     max     filename
12085   16689812        8516    trinity_out_dir/Trinity.fasta

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.
comments powered by Disqus