Installing and Running NCBI BLAST

You should start this tutorial at a prompt that looks something like this:

root@ip-10-82-233-6:~#

Type ‘cd’ to go to your home directory on your EC2 machine.

Now, use your Web browser on your laptop to go to:

right- or control-click on the file ending with ‘x64-linux.tar.gz’, and “copy link URL”. This is the file for 64-bit (large) Linux machines, which is what our EC2 instance is. (The current URL is: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi-blast-2.2.25+-x64-linux.tar.gz)

Now use the ‘curl’ program to download it to your Amazon computer:

%% curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi-blast-2.2.25+-x64-linux.tar.gz

Here, ‘curl’ is a program that takes a Web link and downloads it via the command line; in this case, it’s grabbing that file and saving it into your current directory.

After it completes, you should see the file in your local directory:

%% ls ncbi-*.tar.gz

This is a .tar.gz file, which is kind of like a zip file. You need to use the ‘tar’ program to unpack it (you could use ‘unzip’ if it were a .zip file):

%% tar xzf ncbi-*.tar.gz

This will create a new subdirectory, ‘ncbi-blast-2.2.25+’:

%% ls
Dropbox  ncbi-blast-2.2.25+  ncbi-blast-2.2.25+-x64-linux.tar.gz

If you look in the blast subdirectory, you will see a few more files, most of which are directories:

%% ls ncbi-blast-2.2.25+
bin  ChangeLog  doc  LICENSE  ncbi_package_info  README

In this case, we want to put everything in that bin/ directory into a common place where UNIX knows to look for programs to run. One such place (that, by convention, is a good place to install things that don’t come with the computer) is /usr/local/bin:

%% cp ncbi-blast-2.2.25+/bin/* /usr/local/bin

Now, let’s go to a new section of the machine.

%% cd /mnt

This goes to the folder named ‘/mnt’, which is on another (bigger) disk. We’ll explain this more tomorrow.

Now lets grab some biggish files to work with... the mouse and zebrafish reference proteomes!

Go to ftp://ftp.ncbi.nlm.nih.gov/refseq/ in your browser and explore a bit. You’ll see there’s a bunch of files and directories; in this case, we want to go grab the mouse and zebrafish protein sets. So, grab ftp://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.protein.faa.gz and ftp://ftp.ncbi.nlm.nih.gov/refseq/D_rerio/mRNA_Prot/zebrafish.protein.faa.gz:

%% curl -O ftp://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.protein.faa.gz
%% curl -O ftp://ftp.ncbi.nlm.nih.gov/refseq/D_rerio/mRNA_Prot/zebrafish.protein.faa.gz

These files aren’t .tar.gz files, they’re just .gz files – the .faa means “fasta”. ‘gz’ is a compression scheme for single files; to get at the contents, do uncompress both of them with this command:

%% gunzip *.gz

If you use ‘ls’, you’ll see that the files have turned into ‘mouse.protein.faa’ and ‘zebrafish.protein.faa’:

%% ls

You can also take a look at the contents of the files with the ‘more’ program, which pages through the files.

%% more mouse.protein.faa

Use the spacebar to scroll down, and ‘q’ to exit before reaching the end of the file. You can also look at the zebrafish file:

%% more zebrafish.protein.faa

Now, let’s convert them into BLAST databases:

%% makeblastdb -in mouse.protein.faa -dbtype prot
%% makeblastdb -in zebrafish.protein.faa -dbtype prot

This lets us use BLAST to query the databases for matches.

Before we do a big BLAST, let’s start by doing a small one, just to check that it’s all working. To do that, we’ll skim off some sequences from the top of the file:

%% head zebrafish.protein.faa

The problem here is that ‘head’ by default only selects the first 10 lines of a file, which may not be a complete set of FASTA records – so you may have to tweak things. In this case, the first 14 lines are complete:

%% head -14 zebrafish.protein.faa

Let’s take the output of ‘head’ and put it in a file, ‘zebrafish.top’, that we can use for other purposes:

%% head -14 zebrafish.protein.faa > zebrafish.top

OK, great! Now let’s run a BLASTP comparing these zebrafish sequences to the mouse proteins, and we’ll put the results in a file ‘xxx.txt’:

%% blastp -query zebrafish.top -db mouse.protein.faa -out xxx.txt

(The file name ‘xxx.txt’ is just a throwaway file name, something that we can look at and see is a test file. You can use your own convention; I usually go with something short and recognizably silly, like ‘xxx’, ‘yyy’, ‘foo’, etc.)

OK, now take a look at that file with ‘more’:

%% more xxx.txt

Yep, looks like BLAST output to me!

There’s all sorts of things you can do to alter the BLAST output; run ‘blastp’ to get a list of those options. For example, ‘-evalue 1e-6’ will set the e-value cutoff at 1e-6, above which nothing will be displayed.

Now let’s run a bigger BLAST, all zebrafish proteins against all mouse proteins:

%% blastp -query zebrafish.protein.faa -db mouse.protein.faa -out zebrafish.x.mouse &

This is going to take a while, which is why we told the computer to give us back a command prompt while blastp runs (that’s what the & does).

So, how long is it going to take? We can guesstimate by looking at how many sequences have been processed since we started. To do that, run

%% grep Query= zebrafish.x.mouse

OK, that gives us all the query lines – now what? Let’s count them with ‘wc -l’:

%% grep Query= zebrafish.x.mouse | wc -l

Here, | is what’s known as a ‘pipe’, telling the command line to take the output of ‘grep’ and send it to the command ‘wc’, which counts words, lines, and paragraphs. The ‘-l’ tells wc to count the lines only.

Compare that number to the number of sequences in the zebrafish protein database:

%% grep ^'>' zebrafish.protein.faa | more

to see the FASTA headers, and

%% grep ^'>' zebrafish.protein.faa | wc -l

to count all the sequences.

Last, but not least – let’s run a quick script to convert the file into a set of CSV matches:

%% python ~/Dropbox/ngs-scripts/blast/blast-to-csv.py zebrafish.x.mouse > ~/Dropbox/zebrafish-mouse.csv

Take a look at the script and see if you can understand what it does:

%% more ~/Dropbox/ngs-scripts/blast/blast-to-csv.py

Before you leave for lunch:

Let’s start a second BLAST, all of mouse against all of zebrafish:

%% blastp -query mouse.protein.faa -db zebrafish.protein.faa -out mouse.x.zebrafish &

...now the computer can work while we eat!

When we come back, we can work through a reciprocal BLAST example.


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.
comments powered by Disqus