Question

Replace Accesion Headers by full ncbi Lineage Taxnomy

0

Entering edit mode

5.7 years ago

dllopezr ▴ 120

Hi everybody

I have several multifasta files with headers like this:

>NC_014760.1:c365195-363993 Mycoplasma bovis PG45 clone MU clone A2, complete genome
>NC_013511.1:c476217-475021 Mycoplasma hominis ATCC 23114 chromosome complete genome

I would like to replace this headers by ncbi full taxonomy:

>Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae,Bacillus,Bacillus anthracis

Can you help me with that?

This question FASTA file with taxid, want to obtain lineage and put into header is almost the same that I have, but in there the user had ncbi tax ID instead of accesion number-

Thank you

gene sequence • 1.9k views

ADD COMMENT • link updated 5.7 years ago by shenwei356 8.4k • written 5.7 years ago by dllopezr ▴ 120

0

Entering edit mode

Hello dllopezr,

could you please guide me where the full taxonomy is from and how it is connected to your current headers?

fin swimmer

ADD REPLY • link 5.7 years ago by finswimmer 16k

0

Entering edit mode

The NCBI taxonomy always works in combination with the tax ID. I use the files merged.dmp and rankedlineage.dmp from here ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.zip and ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz.

ADD REPLY • link 5.7 years ago by gb ★ 2.2k

score 2 · Answer 1 · 2018-08-02

Dataset and tools:

Genank accession -> taxid : nucl_gb.accession2taxid.gz
Taxid -> Lineage: taxonkit lineage
Replacing fasta header with corresponding replacements: seqkit replace

Here we go:

Few lines of nucl_gb.accession2taxid.gz

$ zcat nucl_gb.accession2taxid.gz | head -n 3 | column -t
accession  accession.version  taxid  gi
A00002     A00002.1           9913   2
A00003     A00003.1           9913   3

Retrieving the accession from fasta file

$ cat seqs.fa 
>NC_014760.1:c365195-363993 Mycoplasma bovis PG45 clone MU clone A2, complete genome
actg
>NC_013511.1:c476217-475021 Mycoplasma hominis ATCC 23114 chromosome complete genome
ACTN

$ seqkit seq -n -i --id-regexp '^(.+):' seqs.fa | tee seqs.fa.id
NC_014760.1
NC_013511.1

Searching the taxids

$ time csvtk grep -t -f accession.version -P seqs.fa.id nucl_gb.accession2taxid.gz \
    | csvtk cut -t -f accession.version,taxid | tee seqs.fa.id2taxid

NC_013511.1     347256
NC_014760.1     289397

real    0m50.546s
user    1m44.892s
sys     0m4.160s

# faster using shell grep,
$ time zcat nucl_gb.accession2taxid.gz \
    | grep -w  -f seqs.fa.id | cut -f 2,3 | tee seqs.fa.id2taxid
NC_013511.1     347256
NC_014760.1     289397

real    0m19.034s
user    0m23.868s
sys     0m1.384s

Computing lineage

$ time cat seqs.fa.id2taxid \
    | taxonkit lineage -i 2 \
    | sed 1d | cut -f 1,3 | tee seqs.fa.id2taxid2lineage

NC_013511.1     cellular organisms;Bacteria;Terrabacteria group;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma hominis;Mycoplasma hominis ATCC 23114
NC_014760.1     cellular organisms;Bacteria;Terrabacteria group;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovis;Mycoplasma bovis PG45

real    0m3.200s
user    0m5.392s
sys     0m0.224s

Replacing fasta header

$ seqkit replace -k seqs.fa.id2taxid2lineage -p '^(.+):.+' -r '{kv}' seqs.fa \
    | seqkit replace -p ';' -r ','

>cellular organisms,Bacteria,Terrabacteria group,Tenericutes,Mollicutes,Mycoplasmatales,Mycoplasmataceae,Mycoplasma,Mycoplasma bovis,Mycoplasma bovis PG45
actg
>cellular organisms,Bacteria,Terrabacteria group,Tenericutes,Mollicutes,Mycoplasmatales,Mycoplasmataceae,Mycoplasma,Mycoplasma hominis,Mycoplasma hominis ATCC 23114
ACTN

You want the lineage in format of "kingdom, phylum ... species" instead of full lineage? Fine:

$ time cat seqs.fa.id2taxid \
    | taxonkit lineage -i 2 \
    | taxonkit reformat -i 3 
    | sed 1d | cut -f 1,4 | tee seqs.fa.id2taxid2lineage2

NC_013511.1     Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma hominis
NC_014760.1     Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovis

real    0m5.183s
user    0m13.920s
sys     0m0.764s

$ seqkit replace -k seqs.fa.id2taxid2lineage2 -p '^(.+):.+' -r '{kv}' seqs.fa \
    | seqkit replace -p ';' -r ','

>Bacteria,Tenericutes,Mollicutes,Mycoplasmatales,Mycoplasmataceae,Mycoplasma,Mycoplasma bovis
actg
>Bacteria,Tenericutes,Mollicutes,Mycoplasmatales,Mycoplasmataceae,Mycoplasma,Mycoplasma hominis
ACTN

It's exact what you want.