This site is a beta test.
Question: getting feature annotations for all NCBI Refseq sequences
0
Entering edit mode
18 months ago
bitpir • 130

Hi, I was wondering if there's a way to download all feature annotations (Gene; CDS; rRNA; tRNA; ncRNA; repeat_region) of all the Refseq sequences (ftp://ftp.ncbi.nih.gov/refseq/release/) from NCBI? I can't seem to find it anywhere on the web or NCBI. Something like GFF for Genbank sequences would be great. Thanks!

ADD COMMENTlink 18 months ago bitpir • 130 • updated 18 months ago genomax 68k
2
Entering edit mode
18 months ago
genomax 68k
United States

Get summary file for NCBI RefSeq genomes.

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

Grab the ftp path (and or any other fields you need) from this file for each genome.

awk -F '\t' '{print $20}' assembly_summary_refseq.txt > ftp_paths

You should get something like this:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/545/GCF_000599545.1_ASM59954v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/565/GCF_000599565.1_TruePyoMS2391.0
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/599/605/GCF_000599605.1_HJM029

From each of the ftp path directory you should be able to get the *genomic.gff.gz file for that genome.

ADD COMMENTlink 18 months ago genomax 68k
Entering edit mode
0

Awesome! Thank you so much for your answer, this is super helpful! Are the sequences from the Refseq/release represented in the assembly_summary_refseq.txt? And in case of viruses e.g. https://www.ncbi.nlm.nih.gov/nuccore/LC340031.1, is there a similar assembly_summary file where I can get all the annotations? Is there also a NC_ to assembly type of file somewhere? I can see it when I search for the NC number but can never find the file list easily... Thanks a lot for your help!

ADD REPLYlink 18 months ago
bitpir
• 130
Entering edit mode
0

There is a similar summary file for virii. You can find that here.

What exactly do you mean by NC_ to assembly type? Can you give an example?

ADD REPLYlink 18 months ago
genomax
68k
Entering edit mode
0

great, thank you! My virus db includes View all RefSeq and Neighbor nucleotide records, will all of the viruses be captured in the summary file? I was thinking of finding all the NC_ accession number to a particular assembly accession number (e.g. NC_011750.1 --> GCF_000026345.1). I have downloaded the refseq genomic sequences, and I'm kind of working backwards to get the see which NC_ is associated with which assembly number and getting their respective annotation.

ADD REPLYlink 18 months ago
bitpir
• 130
Entering edit mode
0

Ah! Just found the answer to my question, for the second part at least :)

ADD REPLYlink 18 months ago
bitpir
• 130

Login before adding your answer.

Powered by the version 1.5.2