Question

Question: Converting Blast Results/Output to FASTA format

2

Entering edit mode

6.6 years ago

apulunuj ▴ 30

Hello,

I am a graduate student with very little background in bioinformatics, and programming. Currently, I am trying to convert my blast output from a job script I ran on a cluster. The output:

Building a new DB, current time: 09/11/2017 21:50:46
New DB name:   /scratch/napulu/input_protdb/ref_prot.fasta
New DB title:  ref_prot.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /scratch/napulu/input_protdb/ref_prot.fasta
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 11 sequences in 0.00116897 seconds.

AcanGEN_gi|440795652|gb|ELR16769.1|     XP_002649088.2  VYQRDYRDFAVVQPDQLYIREILPPIALESNPITSVCSRFIHTSSNKVKYPINCVSWTPEGRRLVTGSSSGEFTLWNGLTFNFETILQAHDSAVRSIIWSHNEDWMVSGDDSGNIKYWQPNMNNVKIFKAHEQSKIRGLSFSPTDLKLASCADDKIIKIWDFARCTEDNQLVGHGWDVKCVSWHPQKSLIVSGGKDNNIKIWDAKSSQNITTLHGHKSTVSKVEWNQNGNWIVSASSDQLLKVFDIRTMKEMQTFKGHGKEVTALALHPYHEDLFVSGDKDGKILYWIVGNPEPQAEIHSAHDADVWSLSWHPIGHILASGSNDHTTKFWSRNRPADTMKDKYNNPNASKDIENEEDADDQESDLSLPGLSLPGLGSYK     77      455     379

AcanGEN_gi|440798889|gb|ELR19950.1|     XP_020438963.1  LNRIEYVHSKNFIHRDIKPDNFLIGRGKKVTLIHIIDFGLAKKYRDSRSHTHIPYKEGKNLTGTARYASINTHLGIEQSRRDDIEALGYVLMYFLRGSLPWQGLKAISKKDKYDKIMEKKISTSVEVLCRNASFEFVTYLNYCRSLRFEDRPDYTYLRRLLKDLFIREGFTYDFLFDWTCVYASEKDKKKMLENKNRFDQTADQEGRDQR      113     322     210

Initially, I made a script. That split the line based on tab space. Then I used an if statement to see lines containing amino acids one letter code. Then I split those lines with ','. I wish to extract the query_id, and alignment to make the fasta file. The query_id is supposed to be new_line[0].

Does anyone know a better way of doing this? thank you.

alignment sequence python • 4.6k views

ADD COMMENT • link updated 6.6 years ago by kashiff007 ★ 1.9k • written 6.6 years ago by apulunuj ▴ 30

3

Entering edit mode

Not exactly sure what you want to extract in terms of the ID but why not so something like: grep Acan blast_result | awk -F ' ' 'BEGIN {OFS="\n"}{print ">"$2,$3}' this will produce

>XP_002649088.2
VYQRDYRDFAVVQPDQLYIREILPPIALESNPITSVCSRFIHTSSNKVKYPINCVSWTPEGRRLVTGSSSGEFTLWNGLTFNFETILQAHDSAVRSIIWSHNEDWMVSGDDSGNIKYWQPNMNNVKIFKAHEQSKIRGLSFSPTDLKLASCADDKIIKIWDFARCTEDNQLVGHGWDVKCVSWHPQKSLIVSGGKDNNIKIWDAKSSQNITTLHGHKSTVSKVEWNQNGNWIVSASSDQLLKVFDIRTMKEMQTFKGHGKEVTALALHPYHEDLFVSGDKDGKILYWIVGNPEPQAEIHSAHDADVWSLSWHPIGHILASGSNDHTTKFWSRNRPADTMKDKYNNPNASKDIENEEDADDQESDLSLPGLSLPGLGSYK
>XP_020438963.1
LNRIEYVHSKNFIHRDIKPDNFLIGRGKKVTLIHIIDFGLAKKYRDSRSHTHIPYKEGKNLTGTARYASINTHLGIEQSRRDDIEALGYVLMYFLRGSLPWQ

ADD REPLY • link 6.6 years ago by GenoMax 141k

1

Entering edit mode

show the blast command you used.

ADD REPLY • link 6.6 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Here is my blast command.

makeblastdb -in ref_prot.fasta -dbtype prot

blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

ADD REPLY • link 6.6 years ago by apulunuj ▴ 30

0

Entering edit mode

how did you captured the output ?

ADD REPLY • link 6.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Ok so it looks like you piped the output of both commands into the same file, or copied/paste right from the terminal. Good practice for a log, bad practice for parsing. If the blast result was a separate file, you would have a much easier time. Use -out blast.out.txt in your blast command to name the output file.

ADD REPLY • link 6.6 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

please, show us the script, show us the desired output.

ADD REPLY • link 6.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

#! usr/bin/env python   

import sys

infile =open('com_ortho_para.3248444.sched.txt')

for line in infile:

        line = line.split('     ')
        if (',') in line:
                new_line = line.split(',')
                query_id = new_line[0]
                subject_id = new_line[1]
                alignment = new_line[2]
                output.write('>' + query_id + '\n' + alignment) 
                print output

I am trying to making a fasta file in the format.

 '>' query_id
'alignment sequence'

Here is the blast output https://github.com/napulu123/Research-WGD/blob/master/com_ortho_para.3248444.sched.txt

ADD REPLY • link updated 6.6 years ago by GenoMax 141k • written 6.6 years ago by apulunuj ▴ 30

0

Entering edit mode

First, split by '\t', not ' '. Second, splitting by a ',' is probably a bad idea.

ADD REPLY • link 6.6 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

your output looks really badly folded ! what was your blast command ?

ADD REPLY • link 6.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Here it is.

makeblastdb -in ref_prot.fasta -dbtype prot

blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

ADD REPLY • link 6.6 years ago by apulunuj ▴ 30

0

Entering edit mode

The output was captured via a job script from which I ran the blast command. Here is the original script.

#!/bin/bash

#PBS -N transcriptome_blast
#PBS -q tiny12core
#PBS -j oe
#PBS -m abe
#PBS -o com_ortho_para.$PBS_JOBID.txt
#PBS -l nodes=1:ppn=12
#PBS -l walltime=00:00:10:00

cd $PBS_O_WORKDIR

module purge

module load blast

makeblastdb -in ref_prot.fasta -dbtype prot

blastp  -query transcriptome.fasta -db ref_prot.fasta  -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

ADD REPLY • link updated 6.6 years ago by GenoMax 141k • written 6.6 years ago by apulunuj ▴ 30

0

Entering edit mode

warning:

=>> PBS: job killed: walltime 623 exceeded limit 600
PBS Job Statistics:
PBS Job ID: 3248444.sched

ADD REPLY • link 6.6 years ago by Pierre Lindenbaum 161k

score 1 · Answer 1 · 2017-09-13

Here is my blast command. blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

use the option -out fileout.tsv to capture the output in a file. and then use @genomax's awk script.

EDIT: or pipe blastp into genomax's awk and redirect the output

blastp (...) | awk (...) > out.txt

score 1 · Answer 2 · 2017-09-13

1

Entering edit mode

6.6 years ago

kashiff007 ★ 1.9k

grep "^Acan" ur_blast.output | awk '{print">query: "$1"\tHit: "$2"\tscore: "$4":"$5":"$6"\n"$3}' > ur_result.output

ADD COMMENT • link 6.6 years ago by kashiff007 ★ 1.9k