UNIX, BLAST, and long-running jobs

Continuing from Running BLASTs on UNIX:

OK, you’ve set up BLAST and seen roughly how it works. Now let’s do a BLAST of all mouse RefSeq proteins against all zebrafish RefSeq proteins:

%% blastall -i mouse.protein.faa -d zebrafish.protein.faa -o results.txt -p blastp

...and wait.

This will take a loooong time. (Take a look at your watch and write down the start time, while you’re at it ;)

You could get up and take a coffee break, but if you’re impatient, like me, you don’t want to get up every time a computer takes a few minutes to do something. For one thing, you want to know how long it’s going to take. For another thing, maybe you want to do more on that computer!

There are a couple of things you can do at this point to check on the status and estimate how long it’s going to take to do the BLAST.

1. Open up another window and log into the computer that way. Now type ‘top’. This will show you what’s currently running on the computer.

2. Or, hit CTRL-Z and then type ‘bg’. This will put the BLAST into the background, telling to run while you’re doing other things at the command line. You can look to see what’s running in the background on this shell with ‘jobs’, or type ‘fg’ to go back to it (and then CTRL-Z/bg again to background out of it). In this case, there’s no reason to return to it – it will just exit when it’s done running.

OK, now look at the output file. Try typing ‘less results.txt’. What do you see? That’s right, a bunch of BLAST output! We’ll show you how to work with it in a bit, but for now, we just want to see how many BLAST records there are. The simplest way to do that is to search for something that’s unique to each record. There are at least two things that are unique to each BLAST record. One is the string ‘Query=’, and the other is ‘BLAST’ starting at the beginning of a line.

In both situations, you’ll want to search for lines containing that record, and then count them. To do the first, you can do

%% grep ^BLAST results.txt

Here the carat is part of something called a “regular expression” which lets you specify mismatches or wildcards in strings to find – ^ means “find the text BLAST immediately after the beginning of the line.”

This will spit out a lot of results; hit CTRL-C to stop it.

OK, now, we want to count them. You can do this by sending (piping) the results of the grep to another program, ‘wc’, which counts lines.

%% grep ^BLAST results.txt | wc

This will report three numbers: the number of lines, the number of words, and the number of characters. You’re interested in the number of lines, which is the first number.

This will tell you how many BLAST comparisons have been done. But we still need to figure out how many total there are to do! Again, we can use grep to do this:

%% grep ^'>' mouse.protein.faa | wc

This will tell you how many FASTA header lines there are in the file. (There should be 36377, or 36,000 proteins.)

OK, now, if you look at the current time, and how many have been done, and how many are left to be done, what do you estimate is the amount of time it will take to complete the BLAST?

That’s right. A while.

Yep, time for a cup of coffee! Still, isn’t this easier than doing 40,000 BLASTs one by one through the Web interface?

Additional: download a nt database and do appropriate BLASTs.

Advanced topics: using screen. shell scripts.

Parsing

OK, so now we’ve got these great big honking BLAST files – do an

%% ls -l results.txt

Pretty big file, huh? You sure don’t want to look through that line by line, and you couldn’t load it into Word if you tried. A script, to the rescue!

First, get the script:

%% hg clone http://bitbucket.org/ctb/ngs-course

This will create a new directory ‘ngs-course’ which contains a bunch of useful files. We’ll go through a bunch of them.

For now, you just want to use the parse-blast-to-csv.py script:

%% python ngs-course/blast/parse-blast-to-csv.py results.txt > out.csv

Run this, wait a few seconds, and then hit ‘CTRL-C’ to stop the process. Now type ‘less out.csv’. What is this file?

It’s a “comma-separated value” file ...

What can we do next? For now, try loading it into Excel and looking at it; you can do this by copying it to your local computer, either with WinSCP (on Windows) or with ‘scp’ on the Terminal command line on Mac OS X. To do the latter, start up a new Terminal window or Tab, and type):

%% scp -i $privatekey root@$amazon:out.csv ~/Desktop/

replacing $privatekey with the location of your private key (what you’re using with ssh, too), and $amazon with the name of your Amazon EC2 host. This will take the file ‘out.csv’ on your Amazon host and copy it to your desktop.


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.
comments powered by Disqus