Locate positions in a large bed file
2
0
Entering edit mode
6.5 years ago
DVA ▴ 630

I have a ~20G bed file for a few samples, and the bed file contains read depth for all positions of the human genome. Also, I have a list of chromosome positions (~30,000 to 50,000) for each sample, and I would like to know the read depth at these positions. What would be the most efficient method please?

I'm currently thinking to split my large bed file and the list per chromosome or block (via grep or BEDOPS), and work from there, but I would love to know better ways to do this. Thank you very much.

bed • 2.2k views
ADD COMMENT
2
Entering edit mode
6.5 years ago

You don't need to split the file.

Sort your BED file with BEDOPS sort-bed, if unsorted:

$ sort-bed reads.unsorted.bed > reads.bed

Then pass the chromosome name to bedops or bedmap or closest-features via --chrom <chromosome> to do work only on that chromosome:

$ bedops --chrom chrN --operations ...

Or:

$ bedmap --chrom chrN --operations ...

Etc.

If you want a list of all operations, take a look at the BEDOPS documentation.

If you want a fast list of chromosomes:

$ bedextract --list-chr reads.bed

You can pipe this list to a script loop, to do work over each chromosome on a computational cluster. BEDOPS enables parallelization by chromosome pretty easily.

These operations are very fast, because this uses the sort order in sort-bed sorted BED files to do a binary search that determines chromosome bounds. Operations that take minutes or hours can go down to seconds.

If your reads are single-end, you should be able to add --faster to BEDOPS commands to make them use similar techniques to speed operations up even further.

Paired-end reads can result in "nested" elements that prevent the use of --faster without some pre-processing tricks. Your data is on the multi-GB scale such that, if you are working with paired-end reads, you may be interested in investigating these tricks if you're doing these operations repeatedly. See the "extra work" note in the bedextract documentation.

ADD COMMENT
0
Entering edit mode

Thank you very much for the detailed information. The list of positions I'm talking about is no longer at reads level - they are just chromosome & positions. It looks like I should turn this list into another bed file, and just use "6.1.2.3.3. Retrieving elements which overlap target elements" and I'm going to give it a try.

ADD REPLY
1
Entering edit mode
6.5 years ago
GenoMax 141k

BEDOPS should have a function equivalent to intesectbed from Bedtools.

ADD COMMENT
0
Entering edit mode

Thank you so much for your reply. I am looking at this too - do you happen to know how fast these two functions are for large files?

ADD REPLY

Login before adding your answer.

Traffic: 1923 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6