NCBI Minute: Improved Standalone BLAST Databases and Programs: Now with Taxonomic Information

NCBI Minute: Improved Standalone BLAST Databases and Programs: Now with Taxonomic Information

I’m going to talk a little bit about the improved
standalone BLAST databases and emphasize the ability to get taxonomic information from
them directly. We will learn about the new versions of BLAST
and BLAST databases. How to limit your search by taxonomy using
information built into the BLAST databases . We will learn about searching sequences
by accession faster, using accession lists, or seqid lists. And we will use a program that is part of
the blast package to retrieve sequences by taxonomy from blast databases, which is a
handy feature. The main motivation for us making the version
5 blast databases is as you probably know NCBI is preparing for a future where we do
not have G.I. numbers, and blast needs to deal with that. That is part of the motivation for making
these databases. Other features are useful right away for people. These have much faster lookup of accessions
and identifiers and that’s part of getting ready for the GI-less identifiers. It also includes mapped taxonomy identifiers
for all sequences in the database. That is the main thing we will take advantage
of today. These are available as an alpha release on
our FTP site and there is a link to the directory that contains the executables for 2.8.0+. Those executables are required if you’re going
to work with the version 5 databases. There are examples of those available on the
ftp site if you want to try those out. So just to set the stage for what were going
to talk about, one of the most important things you can do is limit to the smallest database
that contains the sequences of interest. The most common way people do that and the
most biologically useful way to do that is to limit to a taxonomic group you’re interested
in. On the webpages you can enter taxonomy queries,
organism limits. You can select certain organisms for example
I’ve got green plants selected and I can remove a subset of that. For example I could get rid of the monocots,
the liliopsids. That would give me a smaller database and
help me to interpret my results. If you want to implement this on standalone,
traditionally this is managed by the GI list. Those are going away. There are other problems with creating a GI
list and as we know, those are going away. there are problems with generating gi lists. One of the things we recommend to people,
but in some cases is not doable, and that is to download a GI list from Entrez. You could do a search for bacteria and get
all of the bacterial proteins this way in principle. That is a lot of proteins. It is so many that even these GI lists are
difficult to download. So the built in taxonomy gets around the problem
and helps. If you look at the standalone programs, the
restriction options are available for the search programs and some are available for
blastcmd as well. That seqid list works a lot better than it
did in previous versions of BLAST . It can be a list of accessions and you need to convert
them to the version 5 format using blastdb_aliastool and I will show you how to do that. You can also have a negative list. And then we have all of these various ones
these that work with taxonomy identifiers. These are the integer IDs from the taxonomy
database. You can specify a list of them on the command
line for the ones that you want leave out, the negative_taxids. You can also make a file, a taxidlist or a
negative taxid list. Blastdbcmd also lets you retrieve sequences
from the databases using the taxonomy IDs. So how do you get the taxonomy IDs? On the web you can do that and these are much
simpler sets, much fewer identifiers. Downloading these is easier than downloading
gi lists. For example, if I want all the green plant
GIs, I would use this query, green plants[subtree]. That gives me everything contained within
that taxon and you can download that as a taxid list. I can use complicated queries if I would like,
Booleans. I could get green plants without the liliopsida,
green plants[subtree] NOT liliopsida[subtree]. And then I can download this as a file on
my hard drive where I have BLAST installed and I can use that to limit. If you have BLAST installed, by the way that
is about 200,000 IDs for green plants, if you have BLAST installed, you will have it
installed on a UNIX or UNIX -like system. People do run this on Windows. The following options do not work on Windows
because it uses a shell script. There is an included shell script with the
new BLAST alpha release that invokes the NCBI EDirect package to get the taxonomy IDs. It runs the query I just showed you on the
web. It is a very simple script. You give it a name like green plants. Or you can give it a tax ID that you can get
by searching to find out what the name is. It gives me all of the IDs contained in that
group. Those eventually get you to the point where
the sequences are mapped and you can get them for your BLAST searches or to retrieve them
from the database. As I said, this option will only be available
for Unix, Linux or cygwin . Partly because it is a shell script and partly because i
relies on EDirect. So this requires that you also install the
EDirect package. The instructions are on that link there.
Basically there is a script there and you can paste that into your window. It will install EDirect for you. It only uses esearch, but you need to install
the whole package. For example, if I put the word soybean in
there as an argument to -n, it will give me back an output that tells me what the taxid
is, the rank, and so on and so forth and the scientific name. If I want to get a list of taxids, I first
need to get the parent so I could say green plants. And then I would pass that taxid, to the larger
group, back into get_species_taxid and it would give me back a list of the taxids that
are in that. Of course I could redirect that output to
a file and that would be my taxid list argument, the name of that file. The other thing with the string lookups making
things faster, it does allow for faster searches using accession lists. These are some data that Tom Madden and others
here have compiled looking at comparisons for how fast things are with the new databases. You will notice with the accession list it
is as fast or faster than searching with the gi list, which was not the case before. One of the things about this, if you get a
list of accessions you need to convert them into a binary format. There is utility that comes with the blast
package called blastdb_aliastool and basically it will turn this into a version 5 seqid list
which is binary and sorted in some way. Here are some sample command lines. These are simple and we will look at some
more complex ones that will optimize the search a little more for using various kinds of things
we will talk about today. The first one here, here is a simple blastp
search. I’m going to use this argonaut protein from
fission yeast as a quarry for what were going to do today. The taxid 3847 is soybean. I would only be searching soybean sequences. On the second one I can look at all green
plants. I can do a search against nr that contains
everything except for green plants if I use the negative taxid list. Say I want to get all the FASTA proteins for
soybean. I could use blastcmd to do that, this time
directly with that taxid, which is very useful. This would give me that in FASTA format. And then finally if I have one of these seqidlist
like this one where I have all the human non-model RefSeq proteins which we will use in a few
minutes. I can do that search and restrict to a particular
set of sequences. Let’s do a few things on the command line
and see if we can get them to work in real time while we are watching them. I’m going to get the tax IDs for soybeans
and flowering plants and I may reduce the size of soybean to beans in the pea family
just to make it easier. I can limit the search to soybean or flowering
plants, once I get the taxonomy IDs. Then I can use blastcmd to extact the pea
family sequences from the nt database by the taxidlist. Finally we’ll do a search limiting nr to the
human non-model RefSeqs by seqid list. I’m going to load the terminal up here and
I’m going to go back and forth between the terminal and a text document and also a web
browser. A few things will be shifting around so we
can demonstrate. Let’s do the same kinds of queries we talked
about before on the web. I’m going to do a web browser for this particular
thing. Let’s say I want to know the tax ID for soybean. I can go to our taxonomy database and run
this on the web. If I retrieve that record I’ve got the tax
ID right here. 3847. So that is a simple one. That is what we call a leaf node or terminal
node tax ID and there will be sequences in the blast databse mapped directly to that
tax ID. The other thing I might want to do is say
I want flowering plants as an example. In this case I want to know not just the tax
ID for flowering plants but everything that is contained within that. When I do that, I get over 170,000 taxids
and if I want to make a list I can save this to a file by checking that file box and then
change the format to tax ID list. It will be ready to use in BLAST . Let’s go
to the command line and do the same thing with the tools we talked about a minute ago. Let’s go ahead and get this thing up. There it is. This is a shell script in my directory here. I can just type help to see what the options
are. I can give it a name or I can give it a taxonomy
ID. So let’s give it a name to show you how that
works. Once the typing gets awkward I will go ahead
and copy and paste these things in here. So that is a quick way to get the tax ID right
there. If I want to get flowering plant tax IDs I
could do that as well or if I just want the pea family I can do that too. Let’s try that one. That way I would know what the pea family
is and then I can get that as a tax ID list if I want to. So let me go ahead and start using my scripts. We will go ahead and run this one where we’re
going to get all of the pea tax IDs. That is going to take a few seconds to run. You can see there is a list of tax IDs there
that I can use to limit BLAST search. What I want to do now, let’s go ahead and
use the first instance of tax IDs which is simply to use a tax ID on the command line. I’m going to put this in. Before I run that, let me explain a couple
of other options that are here because they are useful to know about and they give you
additional options. One of them is bpastp-fast, which uses a larger
word size and some other things to make it run a little quicker. This is a multi-processor machine. I have an evalue restriction here. I have modified output format 7. A lot of people want to put the sequence title
in there. I’m also going to put the scientific name
of the organism. The output format here is 7, whih is tab delimited. Let’s just make this an output file so we
don’t watch it scroll across the screen and we will see how long this takes. Actually, I should have named that soybean
tab but that is okay. And so there is my tab delimited output and
I’ve restricted to soybean. This is using argonaute protein from the fission
yeast and these are all the corresponding BLAST hits. The second instance we will use a tax ID list
to limit to a larger set. Were going to limit to the pea family. I could do this to whatever I want to. This will replace our peas tab with the real
peas tab. The only change in the command line is instead
of using tax ID argument I’m using the tax ID list argument here. That wasn’t too bad. And now we’ve got other members of the pea
family these are other beans not just soybeans. We can scroll through that if we want to. It has all of the features that I asked for
it my output so I have the title of the sequence, the subject sequence plus the scientific name. Let’s say I want to get sequences out of the
BLAST database. This is a handy thing to be able to do. Suppose I want to get soybean sequences from
nr. We are not going to run a blast search, we’re
going to dump out sequences. A real simple option but to be get the soybean
sequences. This is the blastdbcmd command line and it
uses the taxids argument to give me just those sequences from soy bean. I’m going to dump it out in fasta format. That wouldn’t take so long. It is going through all of nr and getting
out the soybean sequences. Here is a dump out of the soybean sequences
from nr. Likewise you can do the same thing with all
of the pea family which I won’t bother to demonstrate, the command line is there in
the handout if you want to take a look and try it. By the way, all of this stuff is on the FTP
site with the rest of the directory. You can try that out yourself. One thing you might want to do to make a list
of non-model RefSeqs. I changed the example. I was going to do peas, but I decided to make
it simple and fast and we are just going to do the human. This is a way of sub setting the database
with an accession list. I will put this one in here. You may notice this has nothing to do with
BLAST. This is a way of running an esearch. I will go ahead and run that. I will come back over here and bring it back
again. We can take a look at what that is. So this is a way of running an E search. Of course EDirect is installed.This is a query
that would give me a nice list of proteins for human. These are the ones that are RefSeqs, but they
are not models. They are not the XP types of RefSeqs. For human there are not that many of those. For some organisms there are lots of those. I’m gonna get those back from the protein
database in the accession format so I get a list of accession. That is running right now. That is not a tremendously large set. We’ll let it go for about two more seconds. I already have this saved. Let’s head that to see what is in their. That is going to be a list of these NP accession
numbers. In order to use this in blast to limit a database
search, what I have to do is convert it to a format that the BLAST program can read . I
am going to take the blastdb_aliastool and do that conversion. That will make it into this seqidlist . That
doesn’t take long at all. And the last thing I’m going to demonstrate
is to run the search limiting to those human non-model reference sequences. Again, it’s a blastp command line exactly
like we have used before but the only difference is that we have the seqidlist argument which
we did not have before. Let’s take a look at that. This is a small set of results. They match the protein from the fission yeast
and so that is a nice BLAST output. I will go back to the PowerPoint slide. Basically what we did was talk about BLAST
dbv5 and 2.8.0+. It works faster with string lookups and it
has much easier options for taxonomic filtering of the BLAST databases and it works faster
with the accession list. There’s no need to use the GI list any longer. This last slide has links to useful material
for EDirect, BLAST and so on. If you have general questions you might want
to check our NCBI Insights blog or our Learn page and Fact sheets and our YouTube channel. You can write to me, [email protected]
or our help desk, [email protected], or blast-help. I will throw it open I think there have been
a lot of questions. [Rana Morris] So far there are three pretty
good questions that everyone might like to hear the answer on. There will be a Q&A document that we will
get all of those in the know to answer and post those on the web. The first one is, this happens to be an alpha
release someone noticed and wanted to know when the final version might be released. [Peter Cooper] I don’t think we know that. I think the idea is we will leave it up there
for a while. How long that it’s going to be may depend
on how much feedback we get. We are trying to get people to try it out
and tell us what they find. I encourage you to do that and let us know
what you think and let us know what we can do to make it work better for you. [Rana] The second question, will the BLAST
API be expanded to support these new features? [Peter] So, let’s see. I think that is what I was going to say too. We have no plans to do that right now. If there is interest in doing that, let us
know and we will add that to the list of things we need to do. [Rana] The last question I currently have,
if we are currently downloading the nr database from the FTP, do we have to format it with
taxonomy information or is it all ready formatted and ready to go? [Peter] A couple things. The one on the FTP directory that I pointed
you to is a different one than a normal one. The one that’s up there in the ordinary BLAST
directory where you’ve always downloaded the database is the version 4 and the taxonomy
information is not in that database. There is no way that I know of for you to
add it to that database. Right, the version 5 one, the link that I
sent you, those have taxonomy information, but the version 4 do not. There’s no way to make those into version
5. That is it for today. If there’s any additional questions I will
get them written up thank you everybody for coming and we will talk to you next time.

Danny Hutson

1 thought on “NCBI Minute: Improved Standalone BLAST Databases and Programs: Now with Taxonomic Information

Leave a Reply

Your email address will not be published. Required fields are marked *