comparing 2 long BED files in R in an efficient way
2
0
Entering edit mode
5.8 years ago
Bogdan ★ 1.4k

Dear all,

please would you advise about an efficient R code to solve the following problem :

-- given a BED file with the variant coordinates (> 10 000 entries) : let's call the file A

-- and a BED file with the gene coordinates (>50 000 genes) : let's call the file B

what is most efficient way to write the comparison in R, between file A and file B, and append the gene from file B to the file A, if there is an intersection of genomic coordinates ? thanks a lot !

VCF BED • 2.5k views
ADD COMMENT
1
Entering edit mode

I would recommend bedtools or bedops for this task instead of R. In R, one can use bedr kind of libraries.

ADD REPLY
0
Entering edit mode
5.8 years ago

Via BEDOPS bedmap:

$ bedmap --echo --echo-map B.bed A.bed > answer.bed

The file answer.bed will contain each element from B.bed (each gene) and — adjacent to the gene — a list of variants from A.bed that overlap that gene. Add the --skip-unmapped option to remove genes from B.bed that do not have overlaps with variants from A.bed.

To run this with R, use bedr or system:

1. https://cran.r-project.org/web/packages/bedr/index.html

2. http://stat.ethz.ch/R-manual/R-devel/library/base/html/system.html

ADD COMMENT
0
Entering edit mode
5.8 years ago

In addition to the suggestion from cpad0112 (note that bedtools/bedops in particular are probably faster than this) you can use the GenomicRanges package:

library(GenomicRanges)
a = c(a, subsetByOverlaps(b, a))

Again, I expect bedtools or bedops are a bit quicker, but you don't have that many regions, so it shouldn't much matter either way.

ADD COMMENT
0
Entering edit mode

Thank you for your comments and help.

ADD REPLY

Login before adding your answer.

Traffic: 1465 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6