Multiple mapping (Taxonomy, Seed, ...) of blast(-like) tabular output
1
0
Entering edit mode
9.7 years ago
BSP • 0

Hello,

I am dealing with short sequence reads (>1,000,000 reads) from a metagenome and a metatranscriptome. I used a BLAST-like program called RAPSearch 2, which provides me with a BLAST Tabular output format like this:

query_id   subject_id       %_identity,  alignment_length  mismatch  gap  query_start  query_end  subject_start  subject_end  log_evalue  bit_score
HWI-ST...  438753.AZC_3721  60.7843      51                20        0    1            51         213            263          -9.38       68.17

and an alignment format like this:

HWI-ST... vs 438753.AZC_3721 bits=68.17 log(E-value)=-9.38 identity=60.7843% aln-len=51 mismatch=20 gap-openings=0 nFrame=0
Query:         1 EGRLASLLTDVAAGRLAPLYNYMKDLPAMEGTPAPFLPRRYIERMLGSSSS 51
                 EGRL  +L D A+G L PLYN+M DLP + GTP PFLP+ Y+ R LG SSS
Sbjct:       213 EGRLDQVLHDAASGTLEPLYNFMNDLPGIGGTPVPFLPKTYVSRTLGLSSS 263

What I want to achieve is to compare the information from the output with mapping files to retrieve crosslinked information in a tab delimited file format. More precisely, I have an output (searched against the eggNOG database) containing the information abou.

Protein name (and the basic output like evalue, alignment length,...)

I downloaded mapping files, which contain:

  1. nog name: taxID.protein name (this is the actual subject_id from the output files!)
  2. cog name: protein name
  3. nog name: function
  4. nog name: description
  5. species name: tax id
  6. cog name: functional category

I would like to collect this information for my search results. I already spent a lot of time browsing through available scripts and programs (eg. BioPython) but I could not yet find a suitable solution for this. Maybe I did oversee something? Additionally, I would like to use for example the TaxID to retrieve more detailed taxonomic classification (eg. Kingdom; Phylum; ...)

Has anyone an idea on how to start here? I would appreciate any idea on this matter.

Thank you!

Edit: I just noticed that Seed is in my post title. I also searched against the RefSeq protein database and the UniRef90 database and wanted to analyze the Seed content of the output as there also exist mapping information against each other. However, I still didn't find a way to do this

RAPSearch2 Mapping BLAST • 3.1k views
ADD COMMENT
0
Entering edit mode
9.7 years ago
5heikki 11k

Sounds like a job for join.

ADD COMMENT
0
Entering edit mode

Thank you 5heikki for this hint. However, as far as I understood join will only bring files together, meaning that 1 line (fle 1) will connect to 1 line (file2; therefore they need to be ordered) resulting in an output only with 1 of each connecting lines.

ADD REPLY
0
Entering edit mode

Output can be customized, e.g.

join -1 1 -2 1 -o 2.1,1.2,2.2 <(sort -k1,1 table1) <(sort k1,1 table2)

Would join tables based on first column value and output field 1 of table2, field 2 of table1 and field 2 of table2

ADD REPLY

Login before adding your answer.

Traffic: 2427 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6