Question

Batch Download from PubMed ftp server

1

Entering edit mode

8.6 years ago

pacman ▴ 70

I have a list of articles (couple hundreds) that I would like to download the full text from PubMed central in PDF. Unfortunately, even though all of them can be found at PubMed central, I can only download less than 10% of them from the FTP service ,based on the file_list.pdf.txt. I am wondering if there is any other ways to do a batch download or just to speed up the process. Thanks!

ncbi pubmed • 8.7k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by pacman ▴ 70

0

Entering edit mode

This is a duplicated question.

How To Download A Few Hundred Pdfs From Pubmed

ADD REPLY • link 8.5 years ago by Tky ★ 1.0k

Ram · Answer 1 · 2015-10-01

2

Entering edit mode

8.6 years ago

Shicheng Guo ★ 9.4k

I have tried. it works. I can download all the files or download whole directory from the FTP.

Download files

wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/08/e0/BCR-3-1-055.PMC13900.pdf ./

Download whole directory:

wget -r ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/08/e0 ./

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.6 years ago by Shicheng Guo ★ 9.4k

0

Entering edit mode

Yes, it works, but not all PMC articles can be found on FTP server due to copyright issues :(

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.5 years ago by pacman ▴ 70

1

Entering edit mode

Which one? Is there any example?

ADD REPLY • link 8.5 years ago by Shicheng Guo ★ 9.4k

Ram · Answer 2 · 2015-10-01

Hi,

I found a trick that make the process go a little bit faster.

Find the corresponding PMC ID (you can find it from PMC-ids.csv.gz)
Follow the format of this URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMCXXXXXX/pdf -> this URL will re-direct the browser to the download link if the pdf is available

I use an excel vba script to semi-automate the process by opening a browser and directing it to the generated URL. This process is not ideal but at least it will minimize the number of clicks you need to download the articles

Ram · Answer 3 · 2015-10-22

I'd say, the easiest way to download all the documents is using FileZilla. Host: ftp.ncbi.nlm.nih.gov /pub/pmc

There you can download them (those from PMC, ie the articles from your list (1.1 million)), in various formats. The root of that directory include a couple of tar.gz files; articles-* has a list of all the 1 million articles in nxml (xml format), the articles.txt.* in plain text. The latter seemed great, at first, but I found some issues with outlining.

Other files 'file_list*' list the journal issues including PMC_id, and the location where more information can be downloaded (a link to ftp.ncbi.nlm.nih.gov/pub/pmc/{hex}/{hex}/{article}. This includes pdfs, nxmls and images...