Blast+ results file parsing to fasta file
3
3
Entering edit mode
9.3 years ago
ahmedakhokhar ▴ 150

Dear all,

I'm new to this forum and new to computational analysis, I'm using standalone NCBI Blast+ (blastp) for the first time, and I have the results file in the following format:

Query= Y
Length=6
Subject= X
Length=739


 Score = 15.4 bits (28),  Expect = 0.044, Method: Composition-based stats.
 Identities = 5/6 (83%), Positives = 6/6 (100%), Gaps = 0/6 (0%)

Query  1    DDDIPF  6
            D+DIPF
Sbjct  244  DNDIPF  250

But I want to do a multiple alignments of all the hits, for that, I need to extract the sequences in the following fasts format:

>Subject= X
Sbjct  244  DNDIPF

Is there any tool which is helpful to do direct multiple alignments from blast results files or a tool/tutorial to extract the sequences in fasta format to process further. Thanks.

sequence python blast ncbi • 13k views
ADD COMMENT
9
Entering edit mode
9.3 years ago
Siva ★ 1.9k

Run blastp with the following outfmt option

-outfmt '6 qseqid sseqid sseq'
  • qseqid means Query Seq-id
  • sseqid means Subject Seq-id
  • sseq means Aligned part of subject sequence

Then use awk to write the hit sequence in FASTA format

awk 'BEGIN { OFS = "\n" } { print ">"$2, $3 }' <YOUR_BLAST_RESULT_FILE>

Are you sure you want to align the HSPs not the full sequences of BLAST hits? What is your goal?

ADD COMMENT
0
Entering edit mode

Dear Siva, my final goal is to take the "subject" sequences from the blastp report (results) and do multiple alignment to the see their evolution patterns.

I'm trying to do blast+ against my own database and wanted to take these sequences only.

ADD REPLY
2
Entering edit mode

If you only take the aligned portion of the subject sequences from the BLAST result, you might end up aligning different segments of the same sequence against one another. To avoid that you also need to take in to account the aligning regions of the query sequence.

The simpler way would be to align the full length sequences of the BLAST hits. You can get the list of all Subject sequence IDs by using any of the following options for -outfmt

  • sallgi means All subject GIs
  • sallseqid means All subject Seq-id(s), separated by a ';'
  • sallacc means All subject accessions

Once you have the list of sequence IDs or GIs, you can use blastdbcmd tool to extract the full length sequences from your BLASTdb, provided it was created with the option -parse_seqids. Then, you can continue with multiple alignment and other evolutionary analyses.

If you don't know already, you will find answers for most of your BLAST related questions here: BLAST Command Line Applications User Manual.

If you still want to go with getting multiple alignment from BLAST results, there is a tool called Blammer from HHsuite that converts BLAST/PSI-BLAST result in to multiple alignments. Though I have never used it.

ADD REPLY
0
Entering edit mode

Thanks Siva, this worked exactly the way I anticipated but { print ">"$2, $3 } failed to write the data into the file, ( $2, $3 ) remain empty, any ideas?

ADD REPLY
1
Entering edit mode

As Ram suggested, please post the full BLAST and awk commands you used. If you have multiple HSPs from one sequence (different regions of the same sequence), using the awk command I posted above will result in multiple sequences with same FASTA headers. Multiple sequence alignment programs expect unique sequence identifiers.

ADD REPLY
2
Entering edit mode

Dear Siva & Ram,

I have used the following commends:

# to get blastp result fle:
blastp -query file.txt -subject file.fasta -outfmt '6 qseqid sseqid sseq evalue' -out file_align.txt

# to write the data.file in fasta :
awk 'BEGIN { OFS = "\n"} {print ">" $1 $2, $3}' file_align.txt > file2.txt

Thanks for the help, Still, I have one more question, I want to restrict my blastp results to evalue<0.001, and I tried to do in with -outfmt (-outfmt '6 qseqid sseqid sseq evalue<0.001), but this filter doesn't seem to work, any idea?

ADD REPLY
1
Entering edit mode

For the awk part, try specifying the separator explicitly using awk -F "\t" 'BEGIN ...

In the second part, where you're trying to filter, you're adding the filter to the formatting option, which is incorrect. You might wanna add the evalue filter option with -evalue 0.00099 to the blastp command

ADD REPLY
0
Entering edit mode

Thank you RamRS, This is what I say looking for. Now my sequences are ready for MSA.

ADD REPLY
0
Entering edit mode

Could you maybe give you the exact full command you used? That might help!

ADD REPLY
1
Entering edit mode
9.3 years ago

You could parse blastp's tabular output using your favorite scripting language and output data in fasta format or use XSLT to convert blastp's xml output to fasta (to get started, have a look at Pierre Lindenbaum's stylesheet here).

ADD COMMENT
0
Entering edit mode
9.3 years ago
Bara'a ▴ 270

@ahmedakhokhar : I think the first portion of your question has already been answered by colleagues , but I can help you though if you still have problems retrieving the hit sequences.

Regarding the second portion of your question, I can recommend you Clustal Omega if you are dealing with extremely large sequence files, which outperforms ClustalW2 in efficiency and memory requirements. It's a quite easy to use Global Multiple Alignment tool with useful features such as the calculation of pairwise distance matrix that represents the evolutionary distance between each pair of sequences. You can refer to the official site for more information : http://www.clustal.org/

Please let me know if you need a hand in your script, I will be happy to help.

ADD COMMENT
0
Entering edit mode
Sorry for the untidy answer , I don't know what's wrong with my browser :(
ADD REPLY
1
Entering edit mode

No worries, I cleaned it up for you :)

ADD REPLY
0
Entering edit mode
Well thank you so much for that @RamRS :D
ADD REPLY

Login before adding your answer.

Traffic: 2438 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6