Best method to retrieve genome sequences
1
0
Entering edit mode
9.8 years ago

What is the best method to get all sequences of the eukaryotic genomes. (just the DNA sequence of the chromosomes)

I wrote a code in perl to retrieve the sequences by accession. but I don't have all the accessions and ensembl will probably block my IP if try to download them all programmatically.

genome sequence • 2.5k views
ADD COMMENT
0
Entering edit mode

I don't think Ensembl / Ensembl Genomes will IP block you if you retrieve genome sequences from their respective ftp sites. IP blocks are in general only given to people who scrape the website and thus bring down the production servers.

ADD REPLY
0
Entering edit mode

I read it on their website, that repeated requests can be considered as abusive :S so not sure about that.

ADD REPLY
0
Entering edit mode

I am sure downloading many files from the ftp site is not considered as abusive. Cheers, Bert (Ensembl team member April 2005 - March 2014 :) )

ADD REPLY
1
Entering edit mode
9.8 years ago
MAPK ★ 2.1k

You can look for taxids, extract all the GI/ accession numbers from those species in NCBI. Once you have GIs, you can download sequences from NCBIs nr/nt database using blastdbcmd -batch_entrez option in standalone blast (don't remember the command exactly).

ADD COMMENT
0
Entering edit mode

I'll give it a try, thank you

ADD REPLY
0
Entering edit mode

But I wouldn't know which GI is for the chromosome, because it holds all the GIs of that taxid without indication to its type

ADD REPLY
0
Entering edit mode

I think that is correct, you have to figure out the way to filter out the mitochondrial sequences.

ADD REPLY
0
Entering edit mode

I found a file one the ncbi genome ftp page that has all the IDs for each kingdom, made it a lot easier. thank you for the hint

ADD REPLY

Login before adding your answer.

Traffic: 3429 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6