How to retrieve the corresponding genome assembly accession ID from a nucleotide accession ID using NCBI eUtility command
2
0
Entering edit mode
3 months ago
Mani • 0

I have a file containing the list of nucleotide sequence accession numbers and I need to get the corresponding assembly accession numbers using Entrez command line (eUtility). For example when I put "CP020622.1" in the search box of the main NCBI web page and choose "assembly" from the menu, it gives me the assembly accession ID: "GCA_002234695.1" (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_002234695.1/). I want to run an eUtility command to output the genome assembly accessions for any input accession ID that I provide (for example I have a chromosome accession ID and I want to retrieve the corresponding genome assembly accession ID). Does anybody know the command that I can use for this purpose?

eDirect NCBI eUtility Entrez Assembly • 1.6k views
ADD COMMENT
0
Entering edit mode
3 months ago
josev.die ▴ 60

This is a possibility using R and the rentrez package.

# Dependencies
library(rentrez)

# Define a search using the "search term"
asearch <- entrez_search(db = "assembly", term =  "CP020622.1[ALL]")

# Get the summary data for that assembly id 
asumm <- entrez_summary(db = "assembly", id = asearch$ids)

# Extract the assembly accession
asumm$assemblyaccession

That code yields :

[1] "GCA_002234695.1"
ADD COMMENT
0
Entering edit mode
3 months ago
GenoMax 141k

Using EntrezDirect (command line utility). You have seen the R version above.

$ esearch -db nuccore -query CP020622 | elink -target assembly | esummary | xtract -pattern DocumentSummary -element AssemblyAccession
GCA_002234695.1

You could also directly search assembly database

$ esearch -db assembly -query CP020622 | esummary | xtract -pattern DocumentSummary -element AssemblyAccession
GCA_002234695.1
ADD COMMENT
0
Entering edit mode

Thanks a lot for your helpful comments and sorry for my delayed response. I could successfully get assembly accession IDs for a number of species and saved them in a text file (one accession per line. Now I want to download the genomes in a compressed fasta format (genome.fa.gz) using the eDirect command line utility. Could you please help me for that as well?

Cheers, Mani

ADD REPLY
2
Entering edit mode

EntrezDirect is not meant to download genome sequences on the command line. You should use NCBI datasets tool for that purpose. Here are relevant details: Download many NCBI genomes with list of GCA identifiers

ADD REPLY
0
Entering edit mode

Thanks for that.

ADD REPLY
0
Entering edit mode

I created a file (database_genome_list.txt) containing assembly accession for a two genomes. for example:

cat database_genome_list.txt

GCA_002234695.1 GCA_002234715.1

when I run datasets command to download them in parallel:

parallel --bar --jobs 2 -a database_genome_list.txt-2 'datasets download genome accession {}'

I get this error:

0% 0:2=0s GCA_002234715.1 Collecting 1 genome record [================================================] 100% 1/1 Downloading: ncbi_dataset.zip 218MB valid zip structure -- files not checked Validating package [================================================] 100% 4/4 50% 1:1=6s GCA_002234715.1 Collecting 1 genome record [================================================] 100% 1/1 Downloading: ncbi_dataset.zip 237MB 5.02MB/s Error: Internal error (invalid zip archive). Please try again

Use datasets download genome accession <command> --help for detailed help about a command.

Do you know how I can download more than one assembly with one command?

Cheers, Mani

ADD REPLY
0
Entering edit mode

Sorry, I could figure it out and resolve the issue. Thanks for your help

ADD REPLY

Login before adding your answer.

Traffic: 1526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6