How to generate multiple sequence alignments from BLAST(stand alone BLAST) hits output result?
1
1
Entering edit mode
5.1 years ago
Kumar ▴ 120

I have to generate multiple sequence alignment from the BLAST hit amino acid sequences. I did tBLASTn and extracted BLAST hit sequences by using the following commands,

tblastn -query xac.fasta -db t_db -outfmt '6 sseqid qseqid qseq evalue' -out msa.xls
awk 'BEGIN { OFS = "\n"} {print ">" $1 $2, $3}' msa.xls > msa1.fasta

But, the output file msa.xls printed special characters (------- *) in between (BLAST hits) amino acid sequences. If it DNA sequence, I could consider * as N, and go ahead of multiple sequence alignment. But in terms of amino acid sequences, N meant for Asparagine (single letter amino acid code). Therefore, how could I carryout multiple sequence alignment for these ( ---) special characters interrupted amino acid sequences. To get rid of this, I have printed only BLAST hit coordinates of the concerned query in BLAST output by using the following command,

tblastn -query xac.fasta -db t_db -outfmt 6 -max_target_seqs 1 -out type2.xls

But, here I have to extract coordinate based BLAST hits. Samtools can be done for single coordinate extraction, but I have to extract the sequences for more than 50 - 60 coordinates. Please help me to solve the same. Thank you in advance.

Dear herve, Please find few lines of msa.xls here,

PDRGLVMVG-ADMATFDMYLLRSEV-VLLCTGRPEDITLDGWFRLEP*RQQRYATRAELIASTQEVPHAARP*SH
BLAST Phylogeny Amino acid sequence MSA • 6.3k views
ADD COMMENT
0
Entering edit mode

Please use 'code sample' fields while pasting pieces of code to make your post more readable.

ADD REPLY
0
Entering edit mode

Could you had some lines of msa.xls in your post please

ADD REPLY
0
Entering edit mode

You do understand qseq is for aligned part of query sequence. The hit would be sseq, i.e. aligned part of the subject sequence.

So anyway if that's what you want, you can do like this to remove "-" and "*"

.. -outfmt '6 sseqid qseqid qseq' | awk 'BEGIN{FS="\t";OFS="\n"}{gsub("-","",$3);gsub("*","",$3);print ">"$1 $2,$3}'
ADD REPLY
0
Entering edit mode

If i remove those - and * special characters from the amino acid sequences, it may create negative effects on my further analysis. My intention is to know, what is meant for * special character, Since I know - meant for gaps. Removing those special characters only is the solution, then its not the right one to do. Could you please suggest me any tool or script to extract the multiple BLAST hit coordinates from the query contigs.

ADD REPLY
0
Entering edit mode

You don't need a script, just check from blastn -help which fields you want in your output, I'm guessing it's some of these:

qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
ADD REPLY
0
Entering edit mode

Dear 5heikki, You have specified BLAST output headers. But, I need to extract the BLAST hit query sequences, based on qstart and qend in the BLAST output, I could use the samtools commands as given below,

samtools faidx xap119.fasta CAGJ01000008.1:500-5059

But, in samtools, only one set of coordinates can be extracted(qstart to qend). My intention is to extract the query sequences based on the BLAST hit output coordinates (qstart to qend) of multiple BLAST results. Because I have been doing BLAST analysis for single query against multiple subject databases, for example, virulence db, enzymes db etc. Please help me to do the same.

ADD REPLY
0
Entering edit mode
$cat seq.fa 
>seq1
aaaaattttt
>seq2
gggggaaaaa
>seq3
cccccttttt

$cat seq.txt
seq1:1-5
seq2:5-10
seq3:2-4

$xargs samtools faidx seq.fa <seq.txt 
[fai_load] build FASTA index.
>seq1:1-5
aaaaa
>seq2:5-10
gaaaaa
>seq3:2-4
ccc
ADD REPLY
0
Entering edit mode

Thank you so much @5heikki. It works perfectly and this is what I exactly need to do. Once again thank you.

ADD REPLY
0
Entering edit mode

Dear 5heikki, I have to specify few parameters/cut-off in the BLAST command line such as,

query/subject length is > 40%
query/subject sequence identity is > 70%

Is it possible to add the same in the BLAST command line.

ADD REPLY
0
Entering edit mode

Check from blastn -help the fields you need and pipe the blast output into awk and use it for filtering awk 'BEGIN{OFS=FS="\t"}{if($x>y && $z>w){print $a, $b, $c}}' where the letters correspond to field/column numbers and your desired cutoff values..

ADD REPLY
0
Entering edit mode
5.1 years ago
gb ★ 2.2k

I want to clarify something.

With blast you can not do a multiple sequence alignment, with blast you do a Pairwise Sequence Alignment.

Tools for a multiple sequence alignment are for example:

  • MUSCLE
  • MAFFT
  • CLUSTAL

So I think the answer to "How to generate multiple sequence alignments from BLAST" is you can't

ADD COMMENT
0
Entering edit mode

Dear gb, Thank you for your help. I have been searching homologues sequences of my target gene families in my concerned genome sequences (more than 30 genomes to be searched). I need to search the homologues sequences with a score of > 70% identity and > 40% query length to the target gene families. From the BLAST results, I intended to extract the query sequences (BLAST hit query sequences against the target gene famillies) and then planned to go for multiple sequence alignment followed by gene tree generation. The gene tree to be compared with species tree as specified in various research articles.

Each query genome BLAST hit sequences to be extracted and then put into single fasta file and subjected for MAFFT alignment.

After referring numerous research articles, I adopted this methodology. Please correct me, if i am wrong.

ADD REPLY

Login before adding your answer.

Traffic: 2753 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6