How to recreate alignment information from TSV output
2
2
Entering edit mode
9.8 years ago
Vivek ▴ 50

Hi,

I have obtained BLAST output file in TSV file format which looks like -

# BLASTN 2.2.29+
# Query: SI2.2.0_06267 Si_gnF.scaffold02592[1282609..1284114].pep_2
# Database: ./nucleotide/Sinvicta2-2-3.cdna.subset.fasta
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 1 hits found
SI2.2.0_06267    SI2.2.0_06267    100.00    240    0    0    1    240    1    240    1e-126      444

This output file has no visual alignment information (which is shown normally by aligning the query and hit sequences on top and drawing |'s in between to show matching). Is it possible to re-generate or create that from the information given by TSV output? If yes, what would be an efficient way to do that?

Please note that, I wish to use TSV output in first place.

Thanks,

blast sequence alignment csv tsv • 4.2k views
ADD COMMENT
1
Entering edit mode

Related post: Convert Blast Output Into Blast-Xml

recreate an alignment from your text file is just like trying to make a cow from a steak

ADD REPLY
0
Entering edit mode

If you have thousands of lines in your output and you want alignments for all of them, you should consider some simple bash script that reads the output line by line and fetches query (column 1) and subject (column 2) sequences for news rounds of blasts. Another alternative is to redo your blast and use ASN output (outfmt 11) which can be converted to all the other formats with the included blast_formatter tool..

ADD REPLY
4
Entering edit mode
9.8 years ago
Michael 54k

It is not possible to recreate the alignment without recomputing it. This is because there is no information about the location of mismatches and gaps. If you need multiple output formats it is better to use ASN.1 binary output format and reformat to the desired output format with blast_formatter.

ADD COMMENT
1
Entering edit mode
9.8 years ago
Yannick Wurm ★ 2.5k

BLAST TSV output can include additional data if appropriate flags/options are provided when it is generated. blastp -help shows a bunch of options including:

             qseq means Aligned part of query sequence
             sseq means Aligned part of subject sequence

Full output is:

 -outfmt <String>
   alignment view options:
     0 = pairwise,
     1 = query-anchored showing identities,
     2 = query-anchored no identities,
     3 = flat query-anchored, show identities,
     4 = flat query-anchored, no identities,
     5 = XML Blast output,
     6 = tabular,
     7 = tabular with comment lines,
     8 = Text ASN.1,
     9 = Binary ASN.1,
    10 = Comma-separated values,
    11 = BLAST archive format (ASN.1) 

   Options 6, 7, and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
           qseqid means Query Seq-id
              qgi means Query GI
             qacc means Query accesion
          qaccver means Query accesion.version
             qlen means Query sequence length
           sseqid means Subject Seq-id
        sallseqid means All subject Seq-id(s), separated by a ';'
              sgi means Subject GI
           sallgi means All subject GIs
             sacc means Subject accession
          saccver means Subject accession.version
          sallacc means All subject accessions
             slen means Subject sequence length
           qstart means Start of alignment in query
             qend means End of alignment in query
           sstart means Start of alignment in subject
             send means End of alignment in subject
             qseq means Aligned part of query sequence
             sseq means Aligned part of subject sequence
           evalue means Expect value
         bitscore means Bit score
            score means Raw score
           length means Alignment length
           pident means Percentage of identical matches
           nident means Number of identical matches
         mismatch means Number of mismatches
         positive means Number of positive-scoring matches
          gapopen means Number of gap openings
             gaps means Total number of gaps
             ppos means Percentage of positive-scoring matches
           frames means Query and subject frames separated by a '/'
           qframe means Query frame
           sframe means Subject frame
             btop means Blast traceback operations (BTOP)
          staxids means unique Subject Taxonomy ID(s), separated by a ';'
                (in numerical order)
        sscinames means unique Subject Scientific Name(s), separated by a ';'
        scomnames means unique Subject Common Name(s), separated by a ';'
       sblastnames means unique Subject Blast Name(s), separated by a ';'
                (in alphabetical order)
       sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
                (in alphabetical order) 
           stitle means Subject Title
       salltitles means All Subject Title(s), separated by a '<>'
          sstrand means Subject Strand
            qcovs means Query Coverage Per Subject
          qcovhsp means Query Coverage Per HSP
   When not provided, the default value is:
   'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
   evalue bitscore', which is equivalent to the keyword 'std'
   Default = `0'
ADD COMMENT

Login before adding your answer.

Traffic: 2047 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6