Question

Comparing methylation data - Data cleaning and efficient code question

0

Entering edit mode

8.1 years ago

startup_biostar ▴ 20

I have a data_1 which is in text format with columns, chr (representing chromosome number), stable_id, start, end & methylation. This is in txt format, mm9 version.

I have a data_2 which is in bigwig format with columns, seqnames, ranges, strand, methylation score. This is in mm10 format. (over 10 million rows)

I am to compare the data_1$start, data_1$end with data_2$ranges and compute the average methylation score and number of CpG islands.

Steps I followed which I believe is a long route.

Step:1 - Converted data_1 to a file format like 'chrN:start-end' and exported the CSV .
Step:2 - Used this CSV file, uploaded to ucsc genome browser LiftOver tool, converted from mm9 to mm10 - Output was a bed file.
Step:3 - Replaced the start and end of data_1 file with new start and end coordinates of the liftovered output bed file.
Step: 4- Comparing the start and end of data_1 with data_2, This is where I am stuck, takes a lot of time using R to process. IS there a simpler way than what I followed?

New to field. Please explain in steps.

genome sequencing R • 1.8k views

ADD COMMENT • link 8.1 years ago by startup_biostar ▴ 20

score 3 · Accepted Answer · 2016-03-09

Welcome to Biostars.
Please see my answer: A: findOverlaps function in R
Here I use foverlaps from the data.table package. It is fast and should give what you want. If there are still problems please edit your question and we will help.
Basically you want to:

setkey(data_1, chr, start, end)  
setkey(data_2, chr, start, end)  
foverlaps(data_1, data_2)

Just friendly suggestion: don't name objects like data_1, use data1 instead. See Google's R Style Guide