I have a set of gene sequences and specific sequence.
cat genes.fa
>Gene_1_chr1_1000_1200
ACGT...
>Gene_2_chr2_3000_3400
TTAT...
cat sequence.fa
>Searchable_sequence
ACGG...
I want to search for this specific sequence in 1. gene sequences; 2. gene flanking sequences.
Gene flanking sequence = Gene coordinates +/- gene size
-- Gene flanking sequence is gene locus plus/minus gene locus size (flanking sequence fasta file is two times bigger than original gene sequences fasta).
cat gene_flanks.fa
>Gene_1_Flank5_chr1_800_1000
CAGT...
>Gene_1_Flank3_chr1_1200_1400
AAGT...
>Gene_2_Flank5_chr2_2600_3000
TTAT...
>Gene_2_Flank3_chr2_3400_3800
ACAT...
Gene database size: 2 sequences - 600 nucleotides
Flanks database size: 4 sequences - 1200 nucleotides
I use BLAST for search:
blastn -task blastn -db genes -query $Sequence -outfmt 6 -out - | wc -l
6
blastn -task blastn -db flanks -query $Sequence -outfmt 6 -out - | wc -l
2
Number of hits between original gene set and flanks set differ. My questions are:
1. Am going to use number of hits for enrichment analysis - how accurate is it to compare number of hits between databases that have different size? (evalue depends of database size and I might be getting bias because of smaller/bigger database size).
2. I want to filter BLAST hits using evalue
- can I use same evalue
for databases that have different size?
or switch db and query ;-)
Query length should also be taken into acount for e-values if I am not mistaken. You can not outsmart Blast I'm afraid.
Query sequence length is taken into account indeed, but not the number of sequences included in query file. So the trick still works :-)
That is true and also the reason why you should not forget to do multiple testing correction on your blast results... unless you are deliberately trying to ignore this issue.
What about merging both fasta files into one database -> single search -> splitting BLAST output for every different data-set?