Biostar Beta. Not for public use.
Using ape package in R to pull genbank sequences
0
Entering edit mode
11 months ago
krc3004 • 0

Hi all- I am trying use the ape package in R to pull protein sequences from GenBank. The sequence I'd like to retrieve is the following: https://www.ncbi.nlm.nih.gov/protein/AAG26087.1

To do this I used the following code:

library(ape)
kl_seq = read.GenBank("AAG26087.1", as.character = TRUE)

And the sequence I obtained was:

$AAG26087.1
  [1] "m" "s" "r" "s" "s" "k" "r" "n" "r" "d" "g" "r" "g" "w" "v" "n" "g" "r" "k" "k" "d" "r" "d" "r" "k" "k" "k" "k" "k" "k" "d" "n" "w" "g" "n" "k" "g" "g" "k" "k" "d"
 [42] "k" "d" "g" "g" "a" "a" "k" "r" "a" "r" "t" "d" "m" "d" "s" "g" "g" "k" "r" "r" "g" "g" "s" "d" "k" "r" "d" "h" "r" "r" "r" "k" "a" "n" "k" "r" "k" "a" "a" "g" "g"
 [83] "k" "h" "s" "k" "k" "r" "t" "d" "r" "r" "r" "r" "t" "a" "g" "s" "v" "g" "g" "v" "n" "g" "g" "s" "r" "g" "a" "g" "g" "g" "v" "n" "m" "s" "v" "s" "s" "r" "t" "g" "g"
[124] "d" "v" "r" "g" "n" "g" "w" "d" "a" "d" "s" "s" "c" "r"

attr(,"species")
[1] "Hepatitis_delta_virus"

However, this sequence clearly does not match the sequence of length 214 aa in the above link. I'm wondering if there's something I'm missing here- perhaps there is another argument, function, or package I should use to pull the correct sequence? Is there any way to bulk download from GenBank directly, given a list of accession IDs? Any tips would be much appreciated. Thanks!

ADD COMMENTlink
1
Entering edit mode

Use NCBI's unix utilities.

efetch -db nuccore -id "AAG26087.1" -format fasta > AAG26087.fa

Replace ID's in a loop if you have multiple ID's to download

ADD REPLYlink
0
Entering edit mode

genomax, thanks very much for your help- I will try this. Is there any way to use efetch to get the nucleotide sequence as well?

ADD REPLYlink
0
Entering edit mode

Just use the right accession number and it will return nucleotide sequence.

ADD REPLYlink
1
Entering edit mode
9 months ago
Republic of Ireland

That's weird. I looked at the function souce code and it executes only 2 possible functions:

URL <- paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=", 
            paste(access.nb[a:b], collapse = ","), "&rettype=fasta&retmode=text")

URL <- paste("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=", 
                paste(access.nb[a:b], collapse = ","), "&rettype=gb&retmode=text")

I'm not sure about the paste(access.nb[a:b], collapse = ",") part; however, if you just go to these URLs in your web browser, you then see the expected sequence. For example:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AAG26087.1&rettype=fasta&retmode=text

This produces:

>AAG26087.1 large delta antigen [Hepatitis delta virus]
MSRSESKRNRDGREGILEQWVNGRKKLEDLERDLRKIKKKIKKLEDENPWLGNIKGILGKKDKDGEGAPP
AKRARTDQMEIDSGPGKRPLRGGFSDKERQDHRRRKALENKRKQLAAGGKHLSKEEEEELKRLTEEDERR
ERRTAGPSVGGVNPLEGGSRGAPGGGFVPNMLSVPESPFSRTGEGLDVRGNQGFPWDILFPADPPFSPQS
CRPQ

The other command:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AAG26087.1&rettype=gb&retmode=text

This just return the webpage (in plain text) at https://www.ncbi.nlm.nih.gov/protein/AAG26087.1

-----------------------------------------

There are ways of automating this in Python but, if you want to stay within R, you can use the httr package. Here is a very messy example, but I'll leave the tidying up to you:

require(httr)
response <- GET("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AAG26087.1&rettype=fasta&retmode=text")
content(r, "text")

No encoding supplied: defaulting to UTF-8.
[1] ">AAG26087.1 large delta antigen [Hepatitis delta virus]\nMSRSESKRNRDGREGILEQWVNGRKKLEDLERDLRKIKKKIKKLEDENPWLGNIKGILGKKDKDGEGAPP\nAKRARTDQMEIDSGPGKRPLRGGFSDKERQDHRRRKALENKRKQLAAGGKHLSKEEEEELKRLTEEDERR\nERRTAGPSVGGVNPLEGGSRGAPGGGFVPNMLSVPESPFSRTGEGLDVRGNQGFPWDILFPADPPFSPQS\nCRPQ\n\n"

You can either mess around with that messy format including the end-line characters, or you can 'cheat' your way and get this into a data-frame as follows:

write(content(r, "text"), "test.fasta")
read.csv("test.fasta")
                X.AAG26087.1.large.delta.antigen..Hepatitis.delta.virus.
1 MSRSESKRNRDGREGILEQWVNGRKKLEDLERDLRKIKKKIKKLEDENPWLGNIKGILGKKDKDGEGAPP
2 AKRARTDQMEIDSGPGKRPLRGGFSDKERQDHRRRKALENKRKQLAAGGKHLSKEEEEELKRLTEEDERR
3 ERRTAGPSVGGVNPLEGGSRGAPGGGFVPNMLSVPESPFSRTGEGLDVRGNQGFPWDILFPADPPFSPQS
4                                                                   CRPQ

What you want to do will depend on your end gal. You could easily convert this into a function that accepts a GenBank accession ID and then loop it.

Kevin

ADD COMMENTlink
0
Entering edit mode

Thank you Kevin! This solution works on my end but it is indeed messy, as you suggest...not sure why the ape package is returning the incorrect sequence. A further question is how to obtain the nucleotide sequence for these accessions but perhaps I can work that out separately...thanks again!

ADD REPLYlink
0
Entering edit mode

Yes, as to what is happening in the ape function should probably be directed to the developers. It's an otherwise extremely good package. Best of luck.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1