Question

Multiple mapping (Taxonomy, Seed, ...) of blast(-like) tabular output

0

Entering edit mode

9.7 years ago

BSP • 0

Hello,

I am dealing with short sequence reads (>1,000,000 reads) from a metagenome and a metatranscriptome. I used a BLAST-like program called RAPSearch 2, which provides me with a BLAST Tabular output format like this:

query_id   subject_id       %_identity,  alignment_length  mismatch  gap  query_start  query_end  subject_start  subject_end  log_evalue  bit_score
HWI-ST...  438753.AZC_3721  60.7843      51                20        0    1            51         213            263          -9.38       68.17

and an alignment format like this:

HWI-ST... vs 438753.AZC_3721 bits=68.17 log(E-value)=-9.38 identity=60.7843% aln-len=51 mismatch=20 gap-openings=0 nFrame=0
Query:         1 EGRLASLLTDVAAGRLAPLYNYMKDLPAMEGTPAPFLPRRYIERMLGSSSS 51
                 EGRL  +L D A+G L PLYN+M DLP + GTP PFLP+ Y+ R LG SSS
Sbjct:       213 EGRLDQVLHDAASGTLEPLYNFMNDLPGIGGTPVPFLPKTYVSRTLGLSSS 263

What I want to achieve is to compare the information from the output with mapping files to retrieve crosslinked information in a tab delimited file format. More precisely, I have an output (searched against the eggNOG database) containing the information abou.

Protein name (and the basic output like evalue, alignment length,...)

I downloaded mapping files, which contain:

nog name: taxID.protein name (this is the actual subject_id from the output files!)
cog name: protein name
nog name: function
nog name: description
species name: tax id
cog name: functional category

I would like to collect this information for my search results. I already spent a lot of time browsing through available scripts and programs (eg. BioPython) but I could not yet find a suitable solution for this. Maybe I did oversee something? Additionally, I would like to use for example the TaxID to retrieve more detailed taxonomic classification (eg. Kingdom; Phylum; ...)

Has anyone an idea on how to start here? I would appreciate any idea on this matter.

Thank you!

Edit: I just noticed that Seed is in my post title. I also searched against the RefSeq protein database and the UniRef90 database and wanted to analyze the Seed content of the output as there also exist mapping information against each other. However, I still didn't find a way to do this

RAPSearch2 Mapping BLAST • 3.1k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by BSP • 0

Ram · Answer 1 · 2014-08-09

0

Entering edit mode

9.7 years ago

5heikki 11k

Sounds like a job for join.

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by 5heikki 11k

0

Entering edit mode

Thank you 5heikki for this hint. However, as far as I understood join will only bring files together, meaning that 1 line (fle 1) will connect to 1 line (file2; therefore they need to be ordered) resulting in an output only with 1 of each connecting lines.

ADD REPLY • link 9.7 years ago by BSP • 0

0

Entering edit mode

Output can be customized, e.g.

join -1 1 -2 1 -o 2.1,1.2,2.2 <(sort -k1,1 table1) <(sort k1,1 table2)

Would join tables based on first column value and output field 1 of table2, field 2 of table1 and field 2 of table2

ADD REPLY • link 9.7 years ago by 5heikki 11k