Question

How to retrieve the corresponding genome assembly accession ID from a nucleotide accession ID using NCBI eUtility command

0

Entering edit mode

3 months ago

Mani • 0

I have a file containing the list of nucleotide sequence accession numbers and I need to get the corresponding assembly accession numbers using Entrez command line (eUtility). For example when I put "CP020622.1" in the search box of the main NCBI web page and choose "assembly" from the menu, it gives me the assembly accession ID: "GCA_002234695.1" (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_002234695.1/). I want to run an eUtility command to output the genome assembly accessions for any input accession ID that I provide (for example I have a chromosome accession ID and I want to retrieve the corresponding genome assembly accession ID). Does anybody know the command that I can use for this purpose?

eDirect NCBI eUtility Entrez Assembly • 1.6k views

ADD COMMENT • link 3 months ago by Mani • 0

score 0 · Answer 1 · 2024-01-03

This is a possibility using R and the rentrez package.

# Dependencies
library(rentrez)

# Define a search using the "search term"
asearch <- entrez_search(db = "assembly", term =  "CP020622.1[ALL]")

# Get the summary data for that assembly id 
asumm <- entrez_summary(db = "assembly", id = asearch$ids)

# Extract the assembly accession
asumm$assemblyaccession

That code yields :

[1] "GCA_002234695.1"

score 0 · Answer 2 · 2024-01-03

0

Entering edit mode

3 months ago

GenoMax 141k

Using EntrezDirect (command line utility). You have seen the R version above.

$ esearch -db nuccore -query CP020622 | elink -target assembly | esummary | xtract -pattern DocumentSummary -element AssemblyAccession
GCA_002234695.1

You could also directly search assembly database

$ esearch -db assembly -query CP020622 | esummary | xtract -pattern DocumentSummary -element AssemblyAccession
GCA_002234695.1

ADD COMMENT • link 3 months ago by GenoMax 141k

0

Entering edit mode

Thanks a lot for your helpful comments and sorry for my delayed response. I could successfully get assembly accession IDs for a number of species and saved them in a text file (one accession per line. Now I want to download the genomes in a compressed fasta format (genome.fa.gz) using the eDirect command line utility. Could you please help me for that as well?

Cheers, Mani

ADD REPLY • link 3 months ago by Mani • 0

2

Entering edit mode

EntrezDirect is not meant to download genome sequences on the command line. You should use NCBI datasets tool for that purpose. Here are relevant details: Download many NCBI genomes with list of GCA identifiers

ADD REPLY • link 3 months ago by GenoMax 141k

0

Entering edit mode

Thanks for that.

ADD REPLY • link 3 months ago by Mani • 0

0

Entering edit mode

I created a file (database_genome_list.txt) containing assembly accession for a two genomes. for example:

cat database_genome_list.txt

GCA_002234695.1 GCA_002234715.1

when I run datasets command to download them in parallel:

parallel --bar --jobs 2 -a database_genome_list.txt-2 'datasets download genome accession {}'

I get this error:

0% 0:2=0s GCA_002234715.1 Collecting 1 genome record [================================================] 100% 1/1 Downloading: ncbi_dataset.zip 218MB valid zip structure -- files not checked Validating package [================================================] 100% 4/4 50% 1:1=6s GCA_002234715.1 Collecting 1 genome record [================================================] 100% 1/1 Downloading: ncbi_dataset.zip 237MB 5.02MB/s Error: Internal error (invalid zip archive). Please try again

Use datasets download genome accession <command> --help for detailed help about a command.