Question

comparing 2 long BED files in R in an efficient way

0

Entering edit mode

5.8 years ago

Bogdan ★ 1.4k

Dear all,

please would you advise about an efficient R code to solve the following problem :

-- given a BED file with the variant coordinates (> 10 000 entries) : let's call the file A

-- and a BED file with the gene coordinates (>50 000 genes) : let's call the file B

what is most efficient way to write the comparison in R, between file A and file B, and append the gene from file B to the file A, if there is an intersection of genomic coordinates ? thanks a lot !

VCF BED • 2.5k views

ADD COMMENT • link updated 5.8 years ago by Devon Ryan 104k • written 5.8 years ago by Bogdan ★ 1.4k

1

Entering edit mode

I would recommend bedtools or bedops for this task instead of R. In R, one can use bedr kind of libraries.

ADD REPLY • link 5.8 years ago by cpad0112 21k

score 0 · Answer 1 · 2018-07-02

Via BEDOPS bedmap:

$ bedmap --echo --echo-map B.bed A.bed > answer.bed

The file answer.bed will contain each element from B.bed (each gene) and — adjacent to the gene — a list of variants from A.bed that overlap that gene. Add the --skip-unmapped option to remove genes from B.bed that do not have overlaps with variants from A.bed.

To run this with R, use bedr or system:

1. https://cran.r-project.org/web/packages/bedr/index.html

2. http://stat.ethz.ch/R-manual/R-devel/library/base/html/system.html

score 0 · Answer 2 · 2018-07-02

0

Entering edit mode

5.8 years ago

Devon Ryan 104k

In addition to the suggestion from cpad0112 (note that bedtools/bedops in particular are probably faster than this) you can use the GenomicRanges package:

library(GenomicRanges)
a = c(a, subsetByOverlaps(b, a))

Again, I expect bedtools or bedops are a bit quicker, but you don't have that many regions, so it shouldn't much matter either way.