Installing and Running NCBI BLAST

You should start this tutorial at a prompt that looks something like this:


Type ‘cd’ to go to your home directory on your EC2 machine.

Now, use your Web browser on your laptop to go to:

right- or control-click on the file ending with ‘x64-linux.tar.gz’, and “copy link URL”. This is the file for 64-bit (large) Linux machines, which is what our EC2 instance is. (The current URL is:

Now use the ‘curl’ program to download it to your Amazon computer:

%% curl -O

Here, ‘curl’ is a program that takes a Web link and downloads it via the command line; in this case, it’s grabbing that file and saving it into your current directory.

After it completes, you should see the file in your local directory:

%% ls ncbi-*.tar.gz

This is a .tar.gz file, which is kind of like a zip file. You need to use the ‘tar’ program to unpack it (you could use ‘unzip’ if it were a .zip file):

%% tar xzf ncbi-*.tar.gz

This will create a new subdirectory, ‘ncbi-blast-2.2.25+’:

%% ls
Dropbox  ncbi-blast-2.2.25+  ncbi-blast-2.2.25+-x64-linux.tar.gz

If you look in the blast subdirectory, you will see a few more files, most of which are directories:

%% ls ncbi-blast-2.2.25+
bin  ChangeLog  doc  LICENSE  ncbi_package_info  README

In this case, we want to put everything in that bin/ directory into a common place where UNIX knows to look for programs to run. One such place (that, by convention, is a good place to install things that don’t come with the computer) is /usr/local/bin:

%% cp ncbi-blast-2.2.25+/bin/* /usr/local/bin

Now, let’s go to a new section of the machine.

%% cd /mnt

This goes to the folder named ‘/mnt’, which is on another (bigger) disk. We’ll explain this more tomorrow.

Now lets grab some biggish files to work with... the mouse and zebrafish reference proteomes!

Go to in your browser and explore a bit. You’ll see there’s a bunch of files and directories; in this case, we want to go grab the mouse and zebrafish protein sets. So, grab and

%% curl -O
%% curl -O

These files aren’t .tar.gz files, they’re just .gz files – the .faa means “fasta”. ‘gz’ is a compression scheme for single files; to get at the contents, do uncompress both of them with this command:

%% gunzip *.gz

If you use ‘ls’, you’ll see that the files have turned into ‘mouse.protein.faa’ and ‘zebrafish.protein.faa’:

%% ls

You can also take a look at the contents of the files with the ‘more’ program, which pages through the files.

%% more mouse.protein.faa

Use the spacebar to scroll down, and ‘q’ to exit before reaching the end of the file. You can also look at the zebrafish file:

%% more zebrafish.protein.faa

Now, let’s convert them into BLAST databases:

%% makeblastdb -in mouse.protein.faa -dbtype prot
%% makeblastdb -in zebrafish.protein.faa -dbtype prot

This lets us use BLAST to query the databases for matches.

Before we do a big BLAST, let’s start by doing a small one, just to check that it’s all working. To do that, we’ll skim off some sequences from the top of the file:

%% head zebrafish.protein.faa

The problem here is that ‘head’ by default only selects the first 10 lines of a file, which may not be a complete set of FASTA records – so you may have to tweak things. In this case, the first 14 lines are complete:

%% head -14 zebrafish.protein.faa

Let’s take the output of ‘head’ and put it in a file, ‘’, that we can use for other purposes:

%% head -14 zebrafish.protein.faa >

OK, great! Now let’s run a BLASTP comparing these zebrafish sequences to the mouse proteins, and we’ll put the results in a file ‘xxx.txt’:

%% blastp -query -db mouse.protein.faa -out xxx.txt

(The file name ‘xxx.txt’ is just a throwaway file name, something that we can look at and see is a test file. You can use your own convention; I usually go with something short and recognizably silly, like ‘xxx’, ‘yyy’, ‘foo’, etc.)

OK, now take a look at that file with ‘more’:

%% more xxx.txt

Yep, looks like BLAST output to me!

There’s all sorts of things you can do to alter the BLAST output; run ‘blastp’ to get a list of those options. For example, ‘-evalue 1e-6’ will set the e-value cutoff at 1e-6, above which nothing will be displayed.

Now let’s run a bigger BLAST, all zebrafish proteins against all mouse proteins:

%% blastp -query zebrafish.protein.faa -db mouse.protein.faa -out zebrafish.x.mouse &

This is going to take a while, which is why we told the computer to give us back a command prompt while blastp runs (that’s what the & does).

So, how long is it going to take? We can guesstimate by looking at how many sequences have been processed since we started. To do that, run

%% grep Query= zebrafish.x.mouse

OK, that gives us all the query lines – now what? Let’s count them with ‘wc -l’:

%% grep Query= zebrafish.x.mouse | wc -l

Here, | is what’s known as a ‘pipe’, telling the command line to take the output of ‘grep’ and send it to the command ‘wc’, which counts words, lines, and paragraphs. The ‘-l’ tells wc to count the lines only.

Compare that number to the number of sequences in the zebrafish protein database:

%% grep ^'>' zebrafish.protein.faa | more

to see the FASTA headers, and

%% grep ^'>' zebrafish.protein.faa | wc -l

to count all the sequences.

Last, but not least – let’s run a quick script to convert the file into a set of CSV matches:

%% python ~/Dropbox/ngs-scripts/blast/ zebrafish.x.mouse > ~/Dropbox/zebrafish-mouse.csv

Take a look at the script and see if you can understand what it does:

%% more ~/Dropbox/ngs-scripts/blast/

Before you leave for lunch:

Let’s start a second BLAST, all of mouse against all of zebrafish:

%% blastp -query mouse.protein.faa -db zebrafish.protein.faa -out mouse.x.zebrafish & the computer can work while we eat!

When we come back, we can work through a reciprocal BLAST example.

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.
comments powered by Disqus