I am trying to get a list of RefSeq from an Assembly in Entrez, but its not working.
First, I am working with GRCh37.1. I can do a manual query on Entrez like so:
GCA_000001405.14[Assembly Accession]
This returns a webpage, that clearly has the RefSeq's in it. I can get a text copy of the full report by clicking the hyperlink "Download full sequence report" which takes me to ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.25.assembly.txt
But when I try to do this query via BioPython, I only get the summary, and I can't seem to get the "full report" which shows the RefSeq's:
from Bio import Entrez
Entrez.email = "brianfeeny@fas.harvard.edu" # Always tell NCBI who you are
# Do a search
handle = Entrez.esearch(db="assembly", term="Homo sapiens[Orgn] AND GCA_000001405.2[ASAC] AND Chromosome[ASLV]")
record = Entrez.read(handle)
for id in record['IdList']:
esummary_handle = Entrez.esummary(db="assembly", id=id, report="full")
esummary_record = Entrez.read(esummary_handle)
print esummary_record
The above correctly shows ID 6328, which is the same UID as the web page query. It returns this JSON data:
{
u'DocumentSummarySet':DictElement({
u'DbBuild':'Build150509-1000.1',
u'DocumentSummary':[
DictElement({
u'ChainId':'1405',
u'AsmUpdateDate':'2013/08/08 00:00',
u'GbUid':'6328',
u'PropertyList':[
'altloci',
'full-genome-representation',
'has-chromosome',
'replaced',
'replaced_genbank'
],
u'SubmissionDate':'2010/02/22 00:00',
u'GB_BioProjects':[
{
u'BioprojectAccn':'PRJNA31257',
u'BioprojectId':'31257'
}
],
u'AssemblyAccession':'GCA_000001405.2',
u'LastMajorReleaseAccession':'GCA_000001405.1',
u'Synonym':{
u'RefSeq':'',
u'Genbank':'GCA_000001405.2',
u'Similarity':''
},
u'AssemblyType':'haploid-with-alt-loci',
u'AssemblyDescription':'Genome Reference Consortium Human Build 37 patch release 1 (GRCh37.p1)',
u'RS_Projects':[
],
u'RefSeq_category':'na',
u'PartialGenomeRepresentation':'false',
u'Coverage':'0',
u'AssemblyClass':'haploid-with-alt-loci',
u'SeqReleaseDate':'2010/02/22 00:00',
u'AnomalousList':[
],
u'AssemblyStatus':'Chromosome',
u'AsmReleaseDate':'2010/02/22 00:00',
u'ReleaseLevel':'Patch',
u'Taxid':'9606',
u'LastUpdateDate':'2013/08/08 00:00',
u'RsUid':'',
u'SortOrder':'5C30000014059799',
u'WGS':'',
u'GB_Projects':[
'31257'
],
u'BioSampleId':'',
u'AssemblyName':'GRCh37.p1',
u'EnsemblName':'',
u'Organism':'Homo sapiens',
u'Biosource':{
u'Isolate':'',
u'InfraspeciesList':[
],
u'Sex':''
},
u'SpeciesName':'Homo sapiens',
u'BioSampleAccn':'',
u'SpeciesTaxid':'9606',
u'UCSCName':'',
u'Primary':'358',
u'Meta':' <Stats> <Stat category="alt_loci_count" sequence_tag="all">9</Stat> <Stat category="chromosome_count" sequence_tag="all">24</Stat> <Stat category="contig_count" sequence_tag="all">462</Stat> <Stat category="contig_l50" sequence_tag="all">24</Stat> <Stat category="contig_n50" sequence_tag="all">38508932</Stat> <Stat category="non_chromosome_replicon_count" sequence_tag="all">0</Stat> <Stat category="replicon_count" sequence_tag="all">24</Stat> <Stat category="scaffold_count" sequence_tag="all">260</Stat> <Stat category="scaffold_count" sequence_tag="placed">190</Stat> <Stat category="scaffold_count" sequence_tag="unlocalized">20</Stat> <Stat category="scaffold_count" sequence_tag="unplaced">39</Stat> <Stat category="scaffold_l50" sequence_tag="all">21</Stat> <Stat category="scaffold_n50" sequence_tag="all">46395641</Stat> <Stat category="total_length" sequence_tag="all">3139095181</Stat> <Stat category="ungapped_length" sequence_tag="all">2899144443</Stat> </Stats> <FtpSites> <FtpPath type="GenBank">ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p1/</FtpPath> </FtpSites> <assembly-level>3</assembly-level> <assembly-status>Chromosome</assembly-status> <representative-status>na</representative-status> <submitter-organization>Genome Reference Consortium</submitter-organization> ',
u'NCBIReleaseDate':'2010/02/22 00:00',
u'SubmitterOrganization':'Genome Reference Consortium',
u'RS_BioProjects':[
]
},
attributes={
u'uid':u'6328'
})
]
},
attributes={
u'status':u'OK'
})
}
So I then try to fetch it:
for id in record['IdList']:
efetch_handle = Entrez.efetch(db="assembly", id=id, retmode="text")
print efetch_handle.read()
But this just returns "6328". No actual file data, RefSeq's or otherwise.
Can someone identify what I am doing wrong?
Should be like that:
And also instead of that doing that you can get the ftp path and download the file using wget: