Question

How to get a FASTA format file with the DNA sequences of all annotated genes?

0

Entering edit mode

2.7 years ago

Student ▴ 30

I am analysing Pyrococcus Furiosus DNA sequencing data by considering data published here in NCBI. When I click on "Send to">"Gene Features">"FASTA format" I download a file that has the sequences of genes of this organism but I realized that this file has some sequences that are doubled... is there a way to get a file with all genes annotated and their respective DNA sequences without double sequences and so in a well "ordered" way ? In NCBI (in the link I reported before) it indicates 2,128 genes so I would like a file with all these 2,128 genes annotated and their respective DNA sequences in FASTA format. Do you know if there is an other website or an other place in NCBI in which I can search to get this kind of file?

fasta sequence-analysis database dataset ncbi • 600 views

ADD COMMENT • link updated 2.7 years ago by Mensur Dlakic ★ 27k • written 2.7 years ago by Student ▴ 30

score 1 · Answer 1 · 2021-08-30

Using Entrezdirect:

$ esearch -db nuccore -query NC_003413.1 | efetch -format fasta_cds_na | grep ">" | wc -l
2060

$ esearch -db nuccore -query NC_003413.1 | efetch -format fasta_cds_na > gene.fa

You should also be able to get the files from genome FTP folder for this organism: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/008/245/085/GCF_008245085.1_ASM824508v1/GCF_008245085.1_ASM824508v1_cds_from_genomic.fna.gz

score 1 · Answer 2 · 2021-08-30

I see two Pyrococcus furiosus DSM 3638 RefSeq assemblies:

I think the files you want have cds_from_genomic.fna.gz in their names. They need to be decompressed using gunzip.