I'm just a beginner in this area so I apologize first for wasting your time with this stupid question:
I recently downloaded some data from Uniprot and want to do some domain coverage analysis, but just realized that there are some overlapping ranges that needs to be merged.
For example, my data looks like this:
no. pfamseq_acc seq_version crc64 md5 pfamA_acc seq_start seq_end
1857719 I1HYZ2 1 E9A7BABADB78EF7A 217559545415239f73152e983a1e4226 PF00083 11 244
1857720 I1HYZ2 1 E9A7BABADB78EF7A 217559545415239f73152e983a1e4226 PF00083 482 737
1857722 A8I2W9 1 125E4A38BE3390D8 3be8943f5916ee0fd2c80adba0af3377 PF00083 83 299
1857723 A8I2W9 1 125E4A38BE3390D8 3be8943f5916ee0fd2c80adba0af3377 PF00083 286 500
1857724 A0A0L0MB57 1 277707E4DDF29006 cf70dbe1ad2c63df7e4b6d419c5dff05 PF00083 10 222
1857725 A0A0L0MB57 1 277707E4DDF29006 cf70dbe1ad2c63df7e4b6d419c5dff05 PF00083 223 423
And what I want is:
no. pfamseq_acc seq_version crc64 md5 pfamA_acc seq_start seq_end
1857719 I1HYZ2 1 E9A7BABADB78EF7A 217559545415239f73152e983a1e4226 PF00083 11 244
1857720 I1HYZ2 1 E9A7BABADB78EF7A 217559545415239f73152e983a1e4226 PF00083 482 737
1857722 A8I2W9 1 125E4A38BE3390D8 3be8943f5916ee0fd2c80adba0af3377 PF00083 83 500
1857724 A0A0L0MB57 1 277707E4DDF29006 cf70dbe1ad2c63df7e4b6d419c5dff05 PF00083 10 423
I have tried to write it in R which looks like:
library(dplyr)
setwd("C:/Users/hy/Documents/Desktop/test2")
d<- read.csv("test2.tsv", header = T, sep = "\t", dec = ".")
d %>%
group_by(md5) %>%
arrange(md5,seq_start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(seq_start)) >
cummax(as.numeric(seq_end)))[-n()])) %>%
group_by(md5,pfamseq_acc,pfamA_acc)
summairse(start = min(seq_start), end = max(seq_end))
The "summarise" could give me the correct result, but I failed to write it in a new file.
So I would appreciate it if anyone could give some ideas about how to achieve that (preferably by R or Python), any other methods is also welcomed.
Thanks.
Thanks for your reply. However my data is in a tsv file. I have tried to do that with "merge" function in BedTools, but it would give error if I convert my file to a BED one.
Just fyi and for the future, "Give error" is not a helpful error message. Please be more precise so that people can reproduce the problem.
Try BEDOPS, maybe, with the example given. If you need to, use
awk
to reorder columns into a minimally three-column file, where the first three columns are: chromosome, start, and stop position.