Blast Database Size Influence On Number Of Significant Hits
1
0
Entering edit mode
10.5 years ago
PoGibas 5.1k

I have a set of gene sequences and specific sequence.

cat genes.fa
>Gene_1_chr1_1000_1200  
ACGT...
>Gene_2_chr2_3000_3400
TTAT...  

cat sequence.fa
>Searchable_sequence
ACGG...

I want to search for this specific sequence in 1. gene sequences; 2. gene flanking sequences.

Gene flanking sequence = Gene coordinates +/- gene size -- Gene flanking sequence is gene locus plus/minus gene locus size (flanking sequence fasta file is two times bigger than original gene sequences fasta).

cat gene_flanks.fa
>Gene_1_Flank5_chr1_800_1000  
CAGT...
>Gene_1_Flank3_chr1_1200_1400  
AAGT...
>Gene_2_Flank5_chr2_2600_3000
TTAT...  
>Gene_2_Flank3_chr2_3400_3800
ACAT...  

Gene database size: 2 sequences - 600 nucleotides
Flanks database size: 4 sequences - 1200 nucleotides

I use BLAST for search:

blastn  -task blastn -db genes -query $Sequence -outfmt 6 -out - | wc -l   
6
blastn  -task blastn -db flanks -query $Sequence -outfmt 6 -out - | wc -l   
2

Number of hits between original gene set and flanks set differ. My questions are:
1. Am going to use number of hits for enrichment analysis - how accurate is it to compare number of hits between databases that have different size? (evalue depends of database size and I might be getting bias because of smaller/bigger database size).
2. I want to filter BLAST hits using evalue - can I use same evalue for databases that have different size?

blast • 4.5k views
ADD COMMENT
1
Entering edit mode
10.5 years ago
lelle ▴ 830

You might want to have a look at the math behind blast bitscores and e-values: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

In general I think what you see makes sens, because in a longer sequence the possibility to get a random hit is higher. So the same hit is not as significant in the bigger database. If you do not want to follow this reasoning you should probably work with the bit scores instead of the e-values.

ADD COMMENT
0
Entering edit mode

or switch db and query ;-)

ADD REPLY
1
Entering edit mode

Query length should also be taken into acount for e-values if I am not mistaken. You can not outsmart Blast I'm afraid.

ADD REPLY
1
Entering edit mode

Query sequence length is taken into account indeed, but not the number of sequences included in query file. So the trick still works :-)

ADD REPLY
0
Entering edit mode

That is true and also the reason why you should not forget to do multiple testing correction on your blast results... unless you are deliberately trying to ignore this issue.

ADD REPLY
0
Entering edit mode

What about merging both fasta files into one database -> single search -> splitting BLAST output for every different data-set?

ADD REPLY

Login before adding your answer.

Traffic: 2433 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6