Question: Converting Blast Results/Output to FASTA format
2
2
Entering edit mode
6.6 years ago
apulunuj ▴ 30

Hello,

I am a graduate student with very little background in bioinformatics, and programming. Currently, I am trying to convert my blast output from a job script I ran on a cluster. The output:

Building a new DB, current time: 09/11/2017 21:50:46
New DB name:   /scratch/napulu/input_protdb/ref_prot.fasta
New DB title:  ref_prot.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /scratch/napulu/input_protdb/ref_prot.fasta
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 11 sequences in 0.00116897 seconds.

AcanGEN_gi|440795652|gb|ELR16769.1|     XP_002649088.2  VYQRDYRDFAVVQPDQLYIREILPPIALESNPITSVCSRFIHTSSNKVKYPINCVSWTPEGRRLVTGSSSGEFTLWNGLTFNFETILQAHDSAVRSIIWSHNEDWMVSGDDSGNIKYWQPNMNNVKIFKAHEQSKIRGLSFSPTDLKLASCADDKIIKIWDFARCTEDNQLVGHGWDVKCVSWHPQKSLIVSGGKDNNIKIWDAKSSQNITTLHGHKSTVSKVEWNQNGNWIVSASSDQLLKVFDIRTMKEMQTFKGHGKEVTALALHPYHEDLFVSGDKDGKILYWIVGNPEPQAEIHSAHDADVWSLSWHPIGHILASGSNDHTTKFWSRNRPADTMKDKYNNPNASKDIENEEDADDQESDLSLPGLSLPGLGSYK     77      455     379

AcanGEN_gi|440798889|gb|ELR19950.1|     XP_020438963.1  LNRIEYVHSKNFIHRDIKPDNFLIGRGKKVTLIHIIDFGLAKKYRDSRSHTHIPYKEGKNLTGTARYASINTHLGIEQSRRDDIEALGYVLMYFLRGSLPWQGLKAISKKDKYDKIMEKKISTSVEVLCRNASFEFVTYLNYCRSLRFEDRPDYTYLRRLLKDLFIREGFTYDFLFDWTCVYASEKDKKKMLENKNRFDQTADQEGRDQR      113     322     210

Initially, I made a script. That split the line based on tab space. Then I used an if statement to see lines containing amino acids one letter code. Then I split those lines with ','. I wish to extract the query_id, and alignment to make the fasta file. The query_id is supposed to be new_line[0].

Does anyone know a better way of doing this? thank you.

alignment sequence python • 4.6k views
ADD COMMENT
3
Entering edit mode

Not exactly sure what you want to extract in terms of the ID but why not so something like: grep Acan blast_result | awk -F ' ' 'BEGIN {OFS="\n"}{print ">"$2,$3}' this will produce

>XP_002649088.2
VYQRDYRDFAVVQPDQLYIREILPPIALESNPITSVCSRFIHTSSNKVKYPINCVSWTPEGRRLVTGSSSGEFTLWNGLTFNFETILQAHDSAVRSIIWSHNEDWMVSGDDSGNIKYWQPNMNNVKIFKAHEQSKIRGLSFSPTDLKLASCADDKIIKIWDFARCTEDNQLVGHGWDVKCVSWHPQKSLIVSGGKDNNIKIWDAKSSQNITTLHGHKSTVSKVEWNQNGNWIVSASSDQLLKVFDIRTMKEMQTFKGHGKEVTALALHPYHEDLFVSGDKDGKILYWIVGNPEPQAEIHSAHDADVWSLSWHPIGHILASGSNDHTTKFWSRNRPADTMKDKYNNPNASKDIENEEDADDQESDLSLPGLSLPGLGSYK
>XP_020438963.1
LNRIEYVHSKNFIHRDIKPDNFLIGRGKKVTLIHIIDFGLAKKYRDSRSHTHIPYKEGKNLTGTARYASINTHLGIEQSRRDDIEALGYVLMYFLRGSLPWQ
ADD REPLY
1
Entering edit mode

show the blast command you used.

ADD REPLY
0
Entering edit mode

Here is my blast command.

makeblastdb -in ref_prot.fasta -dbtype prot

blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

ADD REPLY
0
Entering edit mode

how did you captured the output ?

ADD REPLY
0
Entering edit mode

Ok so it looks like you piped the output of both commands into the same file, or copied/paste right from the terminal. Good practice for a log, bad practice for parsing. If the blast result was a separate file, you would have a much easier time. Use -out blast.out.txt in your blast command to name the output file.

ADD REPLY
0
Entering edit mode

please, show us the script, show us the desired output.

ADD REPLY
0
Entering edit mode
#! usr/bin/env python   

import sys

infile =open('com_ortho_para.3248444.sched.txt')

for line in infile:

        line = line.split('     ')
        if (',') in line:
                new_line = line.split(',')
                query_id = new_line[0]
                subject_id = new_line[1]
                alignment = new_line[2]
                output.write('>' + query_id + '\n' + alignment) 
                print output

I am trying to making a fasta file in the format.

 '>' query_id
'alignment sequence'

Here is the blast output https://github.com/napulu123/Research-WGD/blob/master/com_ortho_para.3248444.sched.txt

ADD REPLY
0
Entering edit mode

First, split by '\t', not ' '. Second, splitting by a ',' is probably a bad idea.

ADD REPLY
0
Entering edit mode

your output looks really badly folded ! what was your blast command ?

ADD REPLY
0
Entering edit mode

Here it is.

makeblastdb -in ref_prot.fasta -dbtype prot

blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

ADD REPLY
0
Entering edit mode

The output was captured via a job script from which I ran the blast command. Here is the original script.

#!/bin/bash

#PBS -N transcriptome_blast
#PBS -q tiny12core
#PBS -j oe
#PBS -m abe
#PBS -o com_ortho_para.$PBS_JOBID.txt
#PBS -l nodes=1:ppn=12
#PBS -l walltime=00:00:10:00

cd $PBS_O_WORKDIR

module purge

module load blast

makeblastdb -in ref_prot.fasta -dbtype prot

blastp  -query transcriptome.fasta -db ref_prot.fasta  -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'
ADD REPLY
0
Entering edit mode

warning:

=>> PBS: job killed: walltime 623 exceeded limit 600
PBS Job Statistics:
PBS Job ID: 3248444.sched
ADD REPLY
1
Entering edit mode
6.6 years ago

Here is my blast command. blastp -query transcriptome.fasta -db ref_prot.fasta -evalue 1e-50 -outfmt '6 qseqid sseqid sseq'

use the option -out fileout.tsv to capture the output in a file. and then use @genomax's awk script.

EDIT: or pipe blastp into genomax's awk and redirect the output

blastp (...) | awk (...) > out.txt
ADD COMMENT
1
Entering edit mode
6.6 years ago
kashiff007 ★ 1.9k
grep "^Acan" ur_blast.output | awk '{print">query: "$1"\tHit: "$2"\tscore: "$4":"$5":"$6"\n"$3}' > ur_result.output
ADD COMMENT

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6