Running the diginorm paper script pipeline

Date:May 21, 2012

Here are some brief notes on how to run the pipeline for our 2012 paper on digital normalization on an Amazon EC2 rental instance.

The instructions below will reproduce all of the figures in the paper, and will then compile the paper from scratch using the new figures.

(Note that you can also start with ami-61885608, which has all the below software installed.)

Starting up a machine and installing software

First, start up an EC2 instance using starcluster:

starcluster start -o -s 1 -i m2.xlarge -n ami-999d49f0 pipeline

You can also do this via the AWS console; just use ami-999d49f0, and start an instance with 16gb or more of memory.

Make sure that port 22 (SSH) and port 80 (HTTP) are open; you’ll need the first one to log in, and the second one to connect to the ipython notebook.

Now, log in!

starcluster sshmaster pipeline

(or just ssh in however you would normally do it.)

Once you’re logged in, you’ll need to install both ‘screed’ and ‘khmer’. In this case we’re going to use the versions tagged for the paper sub.:

cd /usr/local/share

git clone git://
cd screed
git checkout 2012-paper-diginorm
python install
cd ..

git clone git://
cd khmer
git checkout 2012-paper-diginorm
make test
cd ..

echo export PYTHONPATH=/usr/local/share/khmer/python >> ~/.bashrc
echo 'export PATH=$PATH:/usr/local/share/khmer/scripts' >> ~/.bashrc
echo 'export PATH=$PATH:/usr/local/share/khmer/sandbox' >> ~/.bashrc
source ~/.bashrc

OK, now that these are both built, let’s install two other things: the latest version of ipython notebook (you need 0.13dev, or later):

git clone
cd ipython
python install

pip install -U pyzmq

and bowtie:

cd /mnt

curl -L -O
unzip download
cp bowtie-0.12.7/bowtie{,-build} /usr/local/bin

Finally, upgrade the latex install with a few recommended packages:

apt-get install -y texlive-latex-recommended

OK, now all your software is installed, hurrah!

Running the pipeline

First, check out the source repository and grab the (...large) initial data sets:

git clone
cd 2012-paper-diginorm

curl -O
tar xzf pipeline-data-new.tar.gz

Now go into the pipeline directory and run the pipeline. This will take 4-8 hours, so you might want to do it in ‘screen’ (see Handling Long Jobs in Unix).

cd pipeline
make KHMER=/usr/local/share/khmer

Once it successfully completes, copy the data over to the ../data/ directory:

make copydata

Run the ipython notebook server:

cd ../notebook
ipython notebook --pylab=inline --no-browser --ip=* --port=80 &

Connect into the ipython notebook (it will be running at ‘http://<your EC2 hostname>’); if the above command succeeded but you can’t connect in, you probably forgot to enable port 80 on your EC2 firewall.

Once you’re connected in, select the ‘diginorm’ notebook (should be the only one on the list) and open it. Once open, go to the ‘Cell...’ menu and select ‘Run all’.

(Cool, huh?)

Now go back to the command line and execute:

mv *.pdf ../
cd ../

and voila, ‘diginorm.pdf’ will contain the paper with the figures you just created.

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.
comments powered by Disqus