Download complete bacterial genomes and associated plasmid sequences from NCBI
2
0
Entering edit mode
7.6 years ago

Hey all!

I am trying to download all completely assembled bacterial genomes together with the associated plasmid sequences. I download the complete sequences using biopython:

search_term= "bacteria[orgn] AND complete genome[title]"
handle=Entrez.esearch(db="nucleotide", retmax=100000, term=search_term)
genome_id=Entrez.read(handle)['IdList']
print "Fetched Id list..."

This gives me a list of all id numbers of the bacterial genomes. Then I use entrez eftech to download it like this (in both genbank and fasta format):

record=Entrez.efetch(db='nucleotide', id=genome, rettype='fasta', retmode='text') 
time.sleep(1)
seq_record=Entrez.efetch(db='nucleotide', id=genome,rettype='gbwithparts', retmode='text')

However, plasmids are to my knowledge not included in 'complete genome' sequences. I know that the data I need are ordered on NCBI. When typing 'bacteria[ORGN]' as search criterion in the NCBI search, I get a page listing the different bacteria that have sequence data on ncbi (https://www.ncbi.nlm.nih.gov/genome/?term=bacteria%5BORGN%5D). Clicking on a bacterium and then the 'Organism overview: Genome assembly and annotation report' link leads me to a table listing every assembly and the corresponding plasmid sequences (https://www.ncbi.nlm.nih.gov/genome/genomes/154 ). After unclicking contigs, chromosome and scaffolds in the table, it contains exactly the data I want, with genome accession for each complete assembly, plus the corresponding plasmid ID. I can even download it in .csv format.

The problem: NCBI contains 8900 different sequenced species/strains of bacteria. If I have to download the data manually for each bacterium, I have to prolong my education for at least 10 years. Is there any biopython or NCBI guru out there who knows how to automate this?

ncbi genbank biopython • 4.7k views
ADD COMMENT
0
Entering edit mode

Not answering your question directly but the information you need about the plasmids is available in this directory. Get the *.genomic.fna files which have the sequence and the names of the organisms the plasmids are associated with in the fasta header. You will have to split them but sounds like you know python so that should be simple for you.

ADD REPLY
0
Entering edit mode

Hey, thanks for your answer! The problem with this is that several strains of each species exist, but not each of those carries the plasmid. Most of the time, strain information is not in the fasta header. In the table I provided, the plasmids are associated with the respective genomes, which is important for me. I need to know exactly which plasmid is associated with which genome.

ADD REPLY
0
Entering edit mode

You could retrieve the table easily correct and use that information for correlating the plasmids with the right genomes?

Edit: Based on this thread What does 'complete genome' in NCBI include , the information is there in the records for genomic DNA.

ADD REPLY
0
Entering edit mode
6.5 years ago
alceal • 0

Did you get to automate it? Because I need to do the same. What I did is downloaded the CSV and loaded in a pandas data frame but I'd like to know how to do that with biopython.

ADD COMMENT
0
Entering edit mode

Hey,

Yes, in the end I did it as described in the question. For many of the genomes the associated plasmids are contained in the multifasta file if I remember correctly :-)

ADD REPLY
0
Entering edit mode
6.2 years ago
tdmurphy ▴ 190

You could also do this using the "Download Assemblies" button in NCBI Assembly. Start with a query like this:

https://www.ncbi.nlm.nih.gov/assembly/?term=bacteria%5Borgn%5D+AND+has_plasmid%5BProperties%5D+AND+latest_refseq%5Bfilter%5D+AND+complete_genome%5Bfilter%5D

click "Download Assemblies", select RefSeq or GenBank (for FASTA, the only difference should be the accessions), and genomic FASTA. That'll give you a tarball with one file for each assembly (genomic + plasmid(s)).

ADD COMMENT

Login before adding your answer.

Traffic: 1919 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6