Species representtation of the NCBI RefSeq for simulated reads
0
0
Entering edit mode
6.9 years ago
dabid • 0

I want to generate simulated reads from the NCBI RefSeq using ART. As the NCBI RefSeq database is so big and have similar genomes although they are not redundant, I want to get a representative of every possible species in the RefSeq database (Viral, Bacteria, Archaea, etc). So, I will use this species representatives to generate the simulated reads instead of using the whole NCBI database.

Any hints on how to find/get the species representative of RefSeq NCBI?

Thanks.

dna ncbi genome simulated data • 1.5k views
ADD COMMENT
0
Entering edit mode

And how would you select that one sequence (and have it represent) a species)? What exactly are you trying to do by making this dataset?

ADD REPLY
0
Entering edit mode

I want to make a comprehensive simulated reads to benchmark few metagenomic tools. But as the NCBI refseq is very huge (especially for bacteria more than 50000 genomes), I cannot use the whole refseq. This is why I thought about getting only one genome from every species in the refseq. In this way, I reduce the number of genomes that I will use to simulate reads.

ADD REPLY
0
Entering edit mode

Ah, you are planning to use a genome to generate representative reads (not one read per species as I mistakenly thought).

There are assembly summary files on NCBI's genome FTP site (e.g. this one is for RefSeq bacteria). You can get that file and pull out one representative genome (and its accession number). From there you can use the idea here to get the sequence.

ADD REPLY
0
Entering edit mode

yeah, I got the idea.. (Actually I found another link that did almost what I want to do) Thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 3182 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6