Biostar Beta. Not for public use.
Retrieve all predicted cds from NCBI
0
Entering edit mode
2.3 years ago
tlorin • 250
Switzerland

Dear all,

Please apologies if this has been answered somewhere else, but I couldn't find an answer to this problem.

I would like to retrieve all the predicted coding sequences on the NCBI ftp for a species. Let's say I go here. I know how to get all the predicted mRNAs (./RNA/Gnomon_mRNA.fsa) or all the predicted proteins (./protein/protein.fa) but I cannot find how to get the CDS... if ever it's possible? This can be done on the Ensembl FTP.

Thanks for any insight!

ADD COMMENTlink
2
Entering edit mode
24 days ago
genomax 68k
United States

Perhaps not directly but following may be one way.

Get the genes (select protein-coding if you only want those in the left pane): http://www.ncbi.nlm.nih.gov/gene/?term=txid144197[Organism:noexp] Choose "Send to file" and then "tabular text view" to download full table. Cut the interval columns out for the locations and then use getfasta from bedtools to recover the DNA sequence. Use the directions I had put together for that here: C: Why big gaps when I use Entrez Eutils to download protein coding sequences.

ADD COMMENTlink
0
Entering edit mode

Thanks! But then I'd get the introns too, not the cds only?

ADD REPLYlink
1
Entering edit mode

Yes but they would be in lower case (if I recollect). You can remove them that way.

Alternative why not get the GFF file from here and then use the same bedtools getfasta method? You would need to figure out the longest transcript, which is what you probably want.

ADD REPLYlink
0
Entering edit mode

+1 for providing another method. I'm surprised though that it's not a built-in option! Would be more convenient :o)

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1