Running BLASTs on UNIX

Introduction

OK, your first introduction to the use and abuse of command line tools is... BLAST! That’s right, the Basic Local Alignment Search Tool!

I’m going to assume that all of you have used the NCBI BLAST Web page to do individual searches. Today we’ll show you how to go through and do batch searches at the command line on your own computer. This is a technique that works well for small-to-medium sized searches, for example, one transcriptome against another. The various database (nr, nt) are getting big enough that it’s reasonably time consuming to search them on your own, although of course you can do it if you want – you might just have to wait a while for things to finish.

Before I forget, let me say that there are a lot of tips and tricks for working at the UNIX command line that I’m going to show you, so even if you’ve used command line BLAST before, you should skim along.

First, install BLAST.

Installing BLAST

To install the BLAST software, you need to download it from NCBI, unpack it, and copy it into standard locations:

%% curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.24/blast-2.2.24-ia32-linux.tar.gz
%% tar xzf blast-2.2.24-ia32-linux.tar.gz
%% cp blast-2.2.24/bin/* /usr/local/bin
%% cp -r blast-2.2.24/data /usr/local/blast-data

Download the databases

Now, download the databases. Let’s start by doing a BLAST of mouse against zebrafish, to detect putative homology. For this you’ll need the mouse refseq protein db, here, and the zebrafish refseq protein db, here. Use curl to retrieve them:

%% curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.protein.faa.gz
%% curl -O ftp://ftp.ncbi.nih.gov/refseq/D_rerio/mRNA_Prot/zebrafish.protein.faa.gz

This downloads the files into the current working directory from the given FTP site, naming the files for the last part of the path (e.g. ‘mouse.protein.faa.gz’). You can do this from any Web or FTP address.

Now you’ve got these files. How big are they?

%% ls -l *.faa.gz

About 9.6 mb for the mouse protein set, and 8.8 mb for the zebrafish protein set.

Are these files you can work with? Well, no – they’re actually compressed, in the gzip format. Do

%% gunzip *.faa.gz

to uncompress them.

Type

%% less mouse.protein.faa

to display the beginning of the mouse file in a pager (something that will let you scroll down through the file). Type ‘q’ to exit from the pager before the end of the file.

OK, we’ve got the databases, but BLAST requires that each subject database be preformatted for use; this is a way of speeding up certain types of searches. To do this, do

%% formatdb -i zebrafish.protein.faa -o T -p T
%% formatdb -i mouse.protein.faa -o T -p T

The -i parameter gives the name of the database, the -o parameter says “save the results”, and the -p parameter says “this is a protein database”. For DNA, you’d want to use ‘-p F’, or false.

We couldn’t use * here as a way to specify both files because formatdb’s ‘-i’ option takes only one filename, and *.faa expands into two files (see globbing, @CTB).

Before we do a BLAST of all mouse protein against all zebrafish protein, though, let’s make sure our blast is working! To do this, we want to start with something small. Let’s take a few sequences off the top of the mouse protein set:

%% head mouse.protein.faa > mm-first.fa

Here, the program ‘head’ takes the first ten lines from that file, and the ‘>’ tells UNIX to put them into another file, ‘mm-first.fa’. A couple notes here:

  • UNIX generally doesn’t care what the file is called, so ‘.faa’ and ‘.fa’ are all the same to it.
  • UNIX utilities work well with text files, and almost everything you’ll encounter is a basic text file. This is different from Windows and Mac, where more complicated formats are used that can’t be as easily dealt with on UNIX.
  • here’s a useful command-line tip, by the way: you can use Tab to do filename completion. For example, if you type ‘head mou<Tab>’ it will fill out the word after ‘mou’ by looking in the current directory for files starting with ‘mou’ and then filling in after it. In bash (the UNIX shell that you’re all using) if there are multiple files that start with mou, it won’t fill in but will rather beep – hit Tab again to see what the options are.

Now try a BLAST:

%% blastall -i mm-first.fa -d zebrafish.protein.faa -p blastp

...and whoops, it’s done! And it probably scrolled off your screen...

You can do three things at this point. First, you can scroll up in your ssh window to look at the output. Second, you can save the output to a file:

%% blastall -i mm-first.fa -d zebrafish.protein.faa -p blastp -o out.txt

and then use ‘less’ to look at it:

%% less out.txt

and third, you can pipe it directly to less, by which I mean you can send all of the output to the ‘less’ command without saving it to a file:

%% blastall -i mm-first.fa -d zebrafish.protein.faa -p blastp | less

If you do any or all of these, you’ll see that you’ve done a BLAST! It’s got the typical output, it’s got a bunch of matches, etc. And if you scroll down enough, you can see that you’ve actually done TWO BLASTs – because there are two sequences in mm-first.fa.

Additional things to try

Different BLAST programs

Try downloading some different types of sequences and doing BLAST between them. You can specify the program with -p, as above; remember,

  • blastn: DNA x DNA
  • blastp: protein x protein
  • blastx: DNA x protein
  • tblastn: DNA vs DNA, translated
  • tblastx: protein x DNA

and also remember to specify the correct sequence type when running formatdb on the database:

%% formatdb -i $filename -o T -p F

for DNA, and

%% formatdb -i $filename -o T -p T

for protein (the ‘-p’ option is True for protein, False for DNA databases).

Different BLAST options

BLAST has lots and lots and lots of options. Run ‘blastall’ by itself to see what they are. Some of the most useful ones are ‘-v’, ‘-b’, and ‘-e’.


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.
comments powered by Disqus