Retrieve all predicted cds from NCBI
1
0
Entering edit mode
7.7 years ago
tlorin ▴ 360

Dear all,

Please apologies if this has been answered somewhere else, but I couldn't find an answer to this problem.

I would like to retrieve all the predicted coding sequences on the NCBI ftp for a species. Let's say I go here. I know how to get all the predicted mRNAs (./RNA/Gnomon_mRNA.fsa) or all the predicted proteins (./protein/protein.fa) but I cannot find how to get the CDS... if ever it's possible? This can be done on the Ensembl FTP.

Thanks for any insight!

ncbi blast cds ensembl • 1.9k views
ADD COMMENT
2
Entering edit mode
7.7 years ago
GenoMax 141k

Perhaps not directly but following may be one way.

Get the genes (select protein-coding if you only want those in the left pane): http://www.ncbi.nlm.nih.gov/gene/?term=txid144197[Organism:noexp] Choose "Send to file" and then "tabular text view" to download full table. Cut the interval columns out for the locations and then use getfasta from bedtools to recover the DNA sequence. Use the directions I had put together for that here: C: Why big gaps when I use Entrez Eutils to download protein coding sequences.

ADD COMMENT
0
Entering edit mode

Thanks! But then I'd get the introns too, not the cds only?

ADD REPLY
1
Entering edit mode

Yes but they would be in lower case (if I recollect). You can remove them that way.

Alternative why not get the GFF file from here and then use the same bedtools getfasta method? You would need to figure out the longest transcript, which is what you probably want.

ADD REPLY
0
Entering edit mode

+1 for providing another method. I'm surprised though that it's not a built-in option! Would be more convenient :o)

ADD REPLY

Login before adding your answer.

Traffic: 1969 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6