How to separate plasmid proteins from main chromosome proteins in a GenBank assembly record?
1
1
Entering edit mode
8.2 years ago

I'm working on a mega-project involving all proteobacterial proteomes.Recently NCBI relocated all assembly record here. Previously, records for plasmids were kept separately from main chromosomes, but now they are placed in one file. For example, GCA_000010825.1_ASM1082v1_protein.faa.gz.

Question: how in that record separate which proteins came from plasmids and which are from chromosomes?

If I worked on 1 genome, I could have traced each protein individually, but with more than 2,000 complete genomes it's not going to be feasible. Also, I cannot rely on sequence annotations such as,for example "plasmid backbone"since not all plasmid proteins are necessary "plasmid backbone" proteins.

Any ideas?

Assembly genome GenBank • 2.2k views
ADD COMMENT
3
Entering edit mode
8.2 years ago
5heikki 11k

The feature table file..

cut -f5 -d $'\t' GCA_000010825.1_ASM1082v1_feature_table.txt | sort | uniq -c
 5400 chromosome
 844 plasmid
 1 seq_type

Parse the IDs from there and then extract from the faa file..

awk -F '\t' '{if($5=="chromosome")print $11}' GCA_000010825.1_ASM1082v1_feature_table.txt |\
    grep . > GCA_000010825.1_ASM1082v1_chromosome_protein_coding.acc
ADD COMMENT
0
Entering edit mode

Wow! Thank you very much! I would have never known this.

ADD REPLY
0
Entering edit mode

Very useful , Thanks

ADD REPLY

Login before adding your answer.

Traffic: 2758 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6