Question

blastp command line not reporting hits

0

Entering edit mode

4.8 years ago

egerrick • 0

Hi,

I've been running blastp using a large number of queries on a cluster, but am having an unexpected issue of very obvious hits not being reported in the output. Essentially I did RNA-Seq and de novo transcriptome assembly on an organism of interest (with no sequenced genome) straight out of a mouse, and am trying to use BLAST to remove mouse genes and annotate the remaining genes. So I have a list of protein sequences from this (~40,000 proteins) of which many are mouse contamination, and others are from my organism of interest. To determine which are mouse, I created a blast database from all the mouse proteins on uniprot (~100,000 proteins), and then ran the following command on a cluster:

blastp -query my_sequences.fasta -num_threads 16 -db uniprot-mouse -max_target_seqs 1 -outfmt "6 std stitle" -evalue 1e-30 -word_size 7 > output.txt

I used word size of 7 and a low e-value since I expect contaminating mouse proteins to align with almost perfect identity.

My problem is that the output is missing a lot of hits. I've found many query proteins that align with near-perfect identity to mouse proteins that were simply not listed in my blast output. When I blast them on the webserver they are identified with evalues well below my 1e-30 threshold, and I've double checked my mouse protein database and these proteins are in there. Does anyone have any idea why this might be happening?

For extra information, I'm running this job on a cluster with the following parameters:

--time=48:00:00
--cpus-per-task=16
--mem=32000
--mail-type = FAIL

And am using blast+/2.7.1

And I am not getting a failure email, so it doesn't look like the run is simply timing out and thus truncating my output.

Thank you very much, any insights would be extremely appreciated!

software error blast • 1.4k views

ADD COMMENT • link updated 4.8 years ago by Ram 43k • written 4.8 years ago by egerrick • 0

1

Entering edit mode

Couple of things you could try.

Use blat to search your sequences against mouse proteome. As long as your assembly worked reasonably well, you should be able to identify those proteins and then remove them.
You could use bbmap.sh from BBMap suite to align to align your original reads against the mouse genome and then collect reads that do not map. Then repeat the transcriptome assembly on just those reads. Those should be things you are interested in.

I would recommend #2.

ADD REPLY • link 4.8 years ago by GenoMax 142k

0

Entering edit mode

Not sure if this will solve your issue, but these two papers may help you understand your issue a bit more.

Paper1 Paper2

ADD REPLY • link 4.8 years ago by Gio12 ▴ 220

0

Entering edit mode

Thank you very much for your reply. My understanding from the issue addressed in those papers is that the "max_target_seqs" parameter can, in certain situations, sometimes choose hits to output in an arbitrary manner (i.e. based on their order in the database). But my issue is that there is simply no hit in the output at all, despite an almost perfect match being present in the database.

ADD REPLY • link 4.8 years ago by egerrick • 0

0

Entering edit mode

If you are missing a lot of hits, but getting some, then is just a sensitivity problem.

Try to understand what parameters are used on the webserver and use exactly those. Otherwise, you can just slightly relax your evalue.

ADD REPLY • link 4.8 years ago by Fabio Marroni ★ 3.0k