Infer publication date by Pubmed ID
4
0
Entering edit mode
7.3 years ago

I have a table of several thousands of pubmed ids, and I wonder if there is a smart way to infer the publication date for each of them.

My first thought was to search for a table somewhere with a column of the pubmed id together with the publication date. However, since the pubmed ids are associated sequentially, I wonder if it would be enough to just get the min/max pmid for every year, and infer the publication date by looking for the correct interval.

Has anyone ever faced a similar calculation? Which database would you use for this calculation?

pmid year pubmed • 3.1k views
ADD COMMENT
0
Entering edit mode

since the pubmed ids are associated sequentially, I wonder if it would be enough to just get the min/max pmid for every year, and infer the publication date by looking for the correct interval.

that's not always the case...

ADD REPLY
0
Entering edit mode

I see! That's too bad, it means I really need to get a table then.

ADD REPLY
2
Entering edit mode
7.2 years ago

Thanks everybody.

Just for reference, I've decided to download the whole medline and parse the file locally, as I wanted to avoid making hundreds of thousands query updated.

First, I've downloaded the files from ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline

Since these are xml files, I've extracted the publication date using this XSLT template:




<xsl:stylesheet version="1.0" xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="text" encoding="UTF-8"/>

<xsl:template match="PubmedArticle">
        <xsl:value-of select="MedlineCitation/PMID"/>,<xsl:value-of select="MedlineCitation/Article/Journal/JournalIssue/PubDate/Year"/>,<xsl:value-of select="MedlineCitation/DateCompleted/Year"/>
</xsl:template>

</xsl:stylesheet>

Then, transformed all the xml files using GNU/parallel and xsltproc. This provided a number of txt files containing three columns (pmid, date of record creation, and date of publication completed), which was merged and formatted with an R script, to get a final 2-columns file with pmid and year.

Interesting fact: I can now officially demonstrate that the PMID does not directly correlate with the publication date, e.g. two papers with consecutive PMID may have been published in completely different years.

ADD COMMENT
1
Entering edit mode
7.3 years ago
1769mkc ★ 1.2k

I kind of compiled and assembled this script hope it can help you in a way

library(RISmed) 
library(rentrez)
library(XML)
search_topic <- ' ' #specify your query.
search_query <- EUtilsSummary(search_topic, retmax=100, mindate=2010, maxdate=2016) # give the time line as you need.
QueryId(search_query)

your.ids <- print(paste(QueryId(search_query)))

rentrez function to get the data from pubmed db

fetch.pubmed <- entrez_fetch(db = "pubmed", id = your.ids,
                         rettype = "xml", parsed = T)

Extract the Abstracts for the respective IDS.

abstracts = xpathApply(fetch.pubmed, //PubmedArticle//Article', function(x) xmlValue(xmlChildren(x)$Abstract))'

Change the abstract names with the IDS.

names(abstracts) <- your.ids
abstracts
dim(col.abstracts)
write.csv(col.abstracts, file = "abs.csv")
ADD COMMENT
0
Entering edit mode
7.3 years ago
bongok ▴ 40

Have you tried E-utilities? https://www.ncbi.nlm.nih.gov/books/NBK25497/

Fetch pubmed IDs in XML format and write a script to parse out the date. https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch

ADD COMMENT
0
Entering edit mode

That would be a possibility, but I need to do it basically for all the papers ever published! It would be a bit overkill, specially considering that with a SQL query I could simply calculate the max and min pmid per year. How would you structure such a query with the eutils?

ADD REPLY
0
Entering edit mode
7.3 years ago

It would require many requests to entrez, but you could probably do this with biopython. (not exactly a 'smart' way, but it would work)

ADD COMMENT

Login before adding your answer.

Traffic: 2937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6