Biostar Beta. Not for public use.
getting feature annotations for all NCBI Refseq sequences
0
Entering edit mode
2.2 years ago
bitpir • 130

Hi, I was wondering if there's a way to download all feature annotations (Gene; CDS; rRNA; tRNA; ncRNA; repeat_region) of all the Refseq sequences (ftp://ftp.ncbi.nih.gov/refseq/release/) from NCBI? I can't seem to find it anywhere on the web or NCBI. Something like GFF for Genbank sequences would be great. Thanks!

2
Entering edit mode
4 weeks ago
genomax 68k
United States

Get summary file for NCBI RefSeq genomes.

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt


Grab the ftp path (and or any other fields you need) from this file for each genome.

awk -F '\t' '{print \$20}' assembly_summary_refseq.txt > ftp_paths


You should get something like this:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/545/GCF_000599545.1_ASM59954v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/565/GCF_000599565.1_TruePyoMS2391.0
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/605/GCF_000599605.1_HJM029


From each of the ftp path directory you should be able to get the *genomic.gff.gz file for that genome.

0
Entering edit mode

Awesome! Thank you so much for your answer, this is super helpful! Are the sequences from the Refseq/release represented in the assembly_summary_refseq.txt? And in case of viruses e.g. https://www.ncbi.nlm.nih.gov/nuccore/LC340031.1, is there a similar assembly_summary file where I can get all the annotations? Is there also a NC_ to assembly type of file somewhere? I can see it when I search for the NC number but can never find the file list easily... Thanks a lot for your help!

0
Entering edit mode

There is a similar summary file for virii. You can find that here.

What exactly do you mean by NC_ to assembly type? Can you give an example?

0
Entering edit mode

great, thank you! My virus db includes View all RefSeq and Neighbor nucleotide records, will all of the viruses be captured in the summary file? I was thinking of finding all the NC_ accession number to a particular assembly accession number (e.g. NC_011750.1 --> GCF_000026345.1). I have downloaded the refseq genomic sequences, and I'm kind of working backwards to get the see which NC_ is associated with which assembly number and getting their respective annotation.