Question

How to separate plasmid proteins from main chromosome proteins in a GenBank assembly record?

1

Entering edit mode

8.2 years ago

svetlana.lockwood ▴ 20

I'm working on a mega-project involving all proteobacterial proteomes.Recently NCBI relocated all assembly record here. Previously, records for plasmids were kept separately from main chromosomes, but now they are placed in one file. For example, GCA_000010825.1_ASM1082v1_protein.faa.gz.

Question: how in that record separate which proteins came from plasmids and which are from chromosomes?

If I worked on 1 genome, I could have traced each protein individually, but with more than 2,000 complete genomes it's not going to be feasible. Also, I cannot rely on sequence annotations such as,for example "plasmid backbone"since not all plasmid proteins are necessary "plasmid backbone" proteins.

Any ideas?

Assembly genome GenBank • 2.2k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 8.2 years ago by svetlana.lockwood ▴ 20

Ram · Answer 1 · 2016-02-18

3

Entering edit mode

8.2 years ago

5heikki 11k

The feature table file..

cut -f5 -d $'\t' GCA_000010825.1_ASM1082v1_feature_table.txt | sort | uniq -c
 5400 chromosome
 844 plasmid
 1 seq_type

Parse the IDs from there and then extract from the faa file..

awk -F '\t' '{if($5=="chromosome")print $11}' GCA_000010825.1_ASM1082v1_feature_table.txt |\
    grep . > GCA_000010825.1_ASM1082v1_chromosome_protein_coding.acc

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 8.2 years ago by 5heikki 11k

0

Entering edit mode

Wow! Thank you very much! I would have never known this.

ADD REPLY • link 8.2 years ago by svetlana.lockwood ▴ 20

0

Entering edit mode

Very useful , Thanks

ADD REPLY • link 6.7 years ago by sinumolgeorge ▴ 10