Question

Blast+ results file parsing to fasta file

3

Entering edit mode

9.3 years ago

ahmedakhokhar ▴ 150

Dear all,

I'm new to this forum and new to computational analysis, I'm using standalone NCBI Blast+ (blastp) for the first time, and I have the results file in the following format:

Query= Y
Length=6
Subject= X
Length=739


 Score = 15.4 bits (28),  Expect = 0.044, Method: Composition-based stats.
 Identities = 5/6 (83%), Positives = 6/6 (100%), Gaps = 0/6 (0%)

Query  1    DDDIPF  6
            D+DIPF
Sbjct  244  DNDIPF  250

But I want to do a multiple alignments of all the hits, for that, I need to extract the sequences in the following fasts format:

>Subject= X
Sbjct  244  DNDIPF

Is there any tool which is helpful to do direct multiple alignments from blast results files or a tool/tutorial to extract the sequences in fasta format to process further. Thanks.

sequence python blast ncbi • 13k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by ahmedakhokhar ▴ 150

Ram · Answer 1 · 2014-12-28

9

Entering edit mode

9.3 years ago

Siva ★ 1.9k

Run blastp with the following outfmt option

-outfmt '6 qseqid sseqid sseq'

qseqid means Query Seq-id
sseqid means Subject Seq-id
sseq means Aligned part of subject sequence

Then use awk to write the hit sequence in FASTA format

awk 'BEGIN { OFS = "\n" } { print ">"$2, $3 }' <YOUR_BLAST_RESULT_FILE>

Are you sure you want to align the HSPs not the full sequences of BLAST hits? What is your goal?

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Siva ★ 1.9k

0

Entering edit mode

Dear Siva, my final goal is to take the "subject" sequences from the blastp report (results) and do multiple alignment to the see their evolution patterns.

I'm trying to do blast+ against my own database and wanted to take these sequences only.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by ahmedakhokhar ▴ 150

2

Entering edit mode

If you only take the aligned portion of the subject sequences from the BLAST result, you might end up aligning different segments of the same sequence against one another. To avoid that you also need to take in to account the aligning regions of the query sequence.

The simpler way would be to align the full length sequences of the BLAST hits. You can get the list of all Subject sequence IDs by using any of the following options for -outfmt

sallgi means All subject GIs
sallseqid means All subject Seq-id(s), separated by a ';'
sallacc means All subject accessions

Once you have the list of sequence IDs or GIs, you can use blastdbcmd tool to extract the full length sequences from your BLASTdb, provided it was created with the option -parse_seqids. Then, you can continue with multiple alignment and other evolutionary analyses.

If you don't know already, you will find answers for most of your BLAST related questions here: BLAST Command Line Applications User Manual.

If you still want to go with getting multiple alignment from BLAST results, there is a tool called Blammer from HHsuite that converts BLAST/PSI-BLAST result in to multiple alignments. Though I have never used it.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Siva ★ 1.9k

0

Entering edit mode

Thanks Siva, this worked exactly the way I anticipated but { print ">"$2, $3 } failed to write the data into the file, ( $2, $3 ) remain empty, any ideas?

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by ahmedakhokhar ▴ 150

1

Entering edit mode

As Ram suggested, please post the full BLAST and awk commands you used. If you have multiple HSPs from one sequence (different regions of the same sequence), using the awk command I posted above will result in multiple sequences with same FASTA headers. Multiple sequence alignment programs expect unique sequence identifiers.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Siva ★ 1.9k

2

Entering edit mode

Dear Siva & Ram,

I have used the following commends:

# to get blastp result fle:
blastp -query file.txt -subject file.fasta -outfmt '6 qseqid sseqid sseq evalue' -out file_align.txt

# to write the data.file in fasta :
awk 'BEGIN { OFS = "\n"} {print ">" $1 $2, $3}' file_align.txt > file2.txt

Thanks for the help, Still, I have one more question, I want to restrict my blastp results to evalue<0.001, and I tried to do in with -outfmt (-outfmt '6 qseqid sseqid sseq evalue<0.001), but this filter doesn't seem to work, any idea?

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by ahmedakhokhar ▴ 150

1

Entering edit mode

For the awk part, try specifying the separator explicitly using awk -F "\t" 'BEGIN ...

In the second part, where you're trying to filter, you're adding the filter to the formatting option, which is incorrect. You might wanna add the evalue filter option with -evalue 0.00099 to the blastp command

ADD REPLY • link 2.1 years ago by Ram 43k

0

Entering edit mode

Thank you RamRS, This is what I say looking for. Now my sequences are ready for MSA.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by ahmedakhokhar ▴ 150

0

Entering edit mode

Could you maybe give you the exact full command you used? That might help!

ADD REPLY • link 2.1 years ago by Ram 43k

Ram · Answer 2 · 2014-12-28

1

Entering edit mode

9.3 years ago

Jean-Karim Heriche 27k

You could parse blastp's tabular output using your favorite scripting language and output data in fasta format or use XSLT to convert blastp's xml output to fasta (to get started, have a look at Pierre Lindenbaum's stylesheet here).

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Jean-Karim Heriche 27k

Ram · Answer 3 · 2014-12-30

0

Entering edit mode

9.3 years ago

Bara'a ▴ 270

@ahmedakhokhar : I think the first portion of your question has already been answered by colleagues , but I can help you though if you still have problems retrieving the hit sequences.

Regarding the second portion of your question, I can recommend you Clustal Omega if you are dealing with extremely large sequence files, which outperforms ClustalW2 in efficiency and memory requirements. It's a quite easy to use Global Multiple Alignment tool with useful features such as the calculation of pairwise distance matrix that represents the evolutionary distance between each pair of sequences. You can refer to the official site for more information : http://www.clustal.org/

Please let me know if you need a hand in your script, I will be happy to help.

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Bara'a ▴ 270

0

Entering edit mode

Sorry for the untidy answer , I don't know what's wrong with my browser :(

ADD REPLY • link 9.3 years ago by Bara'a ▴ 270

1

Entering edit mode

No worries, I cleaned it up for you :)

ADD REPLY • link 2.1 years ago by Ram 43k

0

Entering edit mode

Well thank you so much for that @RamRS :D

ADD REPLY • link 9.3 years ago by Bara'a ▴ 270