blastn for short sequences
2
4
Entering edit mode
9.4 years ago
User6891 ▴ 330

Hi everyone,

I was trying to use blast to map short sequences (between 15 - 30 nucleotides long) to the human reference genome

I have blast locally installed:

these are the options that I used:

blastn -dust no -db ucsc.h19.fasta -outfmt 7 -word_size 7 -evalue 1000 -perc_identity 100 -ungapped -query .. -out ...

I still get way to many matches for my input, while there is in fact only one match in the whole genome. What I see in the output is the good match as the first line, which is off course already good.

However then there are like 1000 matches following, where the percentage identity is 100, but they are never exactly the same match as my query sequence. Lets say my query sequence is 20 bp long, I only want to get the match where the alignment length is 20 and the q.start = 1 & q.end is 20. And this is only the case for the first line in my output. The rest of the lines never has an alignment length of 20. Is there a way to only output the matches where the alignment length is the same as the length of my query?

I can always filter my output afterwards, but I have a lot of sequences to blast, so I would like to get it as clean as possible from the beginning.

blast • 12k views
ADD COMMENT
3
Entering edit mode
9.4 years ago
Siva ★ 1.9k

You can try the -task option and set it to blastn-short which is optimized for sequences less than 30 nucleotides.

It uses slightly different default values for some options:

-word_size = 7 (which you used)
-reward = 1
-penalty = -3

For controlling the BLAST hits, can you try the option -qcov_hsp_perc and set it to 100?

From blastn -help

-qcov_hsp_perc <Real, 0..100>
   Percent query coverage per hsp
ADD COMMENT
0
Entering edit mode

Thanks this -qcov_hsp_perc <Real, 0..100> might do the trick, but it's only present starting from version 2.2.30 released in October 2014. So I will have to ask for an update on our servers

ADD REPLY
1
Entering edit mode
9.4 years ago
Cytosine ▴ 460

A good question that I would also like to know the answer for :)

I usually use the -max_target_seqs switch, with a low value (1-3), to reduce the number of hits and parse those out myself. Often the first hit is also the best alignment, though I double check the interesting hits for true positives.

Use with caution :)

ADD COMMENT

Login before adding your answer.

Traffic: 2878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6