How to download latest human AGP chromosome files (no alts, MT, unplaced, unlocalized) from NCBI?
2
0
Entering edit mode
5.8 years ago
eroberts • 0

Hello,

Currently I'm trying to figure out how to download the latest patch of the hg38 assembly in AGP format from NCBI.

I have some oddly stringent requirements which makes this process difficult:

  1. I need to make the command as short and as simple as possible for other users to use.
  2. It should be future proof and get the latest patch number. However, I'd settle on a permanent location for older patch numbers.
  3. In the FTP link above, there are AGP files that are not necessary for our users to obtain. I only need the AGP files for chr1-22, X and Y.

I've tried various combinations of wget recursive/accept-regex combinations but it seems almost ignored since I don't believe it fetches a proper html file since it refers to a FTP site. You can "glob" on FTP sites using wget, which fetches the ".listing" file and the glob matches the pattern in there but I cannot find a pattern that only matches the AGP files I'm interested in.

Any insight or best guesses would be greatly appreciated.

Thanks!

assembly bash agp human hg38 • 1.8k views
ADD COMMENT
2
Entering edit mode
5.8 years ago
GenoMax 141k

Use the following command to get just the agp files. Since the link below will always contain the latest files it should be reasonably foolproof:

wget -r -nd -H --reject "index.*" --accept "*chr*.agp.gz" --reject "*chrMT*" ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/Assembled_chromosomes/agp/
ADD COMMENT
0
Entering edit mode

Unfortunately, this retrieves the alt, MT, unplaced and unlocalized AGP files as well which doesn't suit my 3rd requirement. There is always the option of having this done in multiple commands by removing the unnecessary chromosomes afterwards but again I'm trying to keep this as short and simple as possible.

ADD REPLY
0
Entering edit mode

No it does not. I am not sure what wget you are using but on my system I only get chr* files. If you don't need chrMT then I have amended the command above to reject that file.

ADD REPLY
0
Entering edit mode

My mistake, yes I meant only MT file as well.

ADD REPLY
0
0
Entering edit mode

I've thought about using the parallel option, if I went that route I'd still have to worry about the patch number which may change.

Also with the 2nd wget command, it still runs into the issue of not being patch number agnostic. You can also put the bash sequences into a single "{}" like so: "{{1..22},{X,Y}}" which is what I had previously except for the issue with the patch number I just mentioned. I'm also not a big fan of using bash this way since it runs multiple wget commands, which in the case of the ftp access, means accessing and downloading the .listing file for each chromosome (which is not the worst just not preferred).

ADD REPLY
0
Entering edit mode

Thanks for the detaills. Got your point.

ADD REPLY

Login before adding your answer.

Traffic: 2392 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6