Biostar Beta. Not for public use.
Cluster the intervals but keep strand and return the average score
Entering edit mode
16 months ago
ifreecell • 170

Hi, I got a file in the following format

chr1 33711 33712 + 3.29
chr1 33712 33713 + 3.31
chr1 33713 33714 + 3.33
chr1 33714 33715 + 3.34
chr1 33715 33716 + 3.33
chr1 33716 33717 + 3.32

I don't think this file is so compact, so I want change it to something like

chr1 33711 33717 + 3.32

I tried clustering the intervals using Galaxy, but it just returned the first three columns

chr1 33711 33716

I really need to keep the strand and score column, because later I will sort the file based on the score. So is there any script or tools can do the job? It's better to have a option to return the mean or average value of that range in the fifth column.

Here is a sample file waiting to be tested.

Bed wig • 1.2k views
Entering edit mode
11 months ago
Freiburg, Germany

Here's an R solution (I've made it a bit longer than needed to make it easier to follow), though you could just iterate over things in python or perl.

foo <- import.bed("S2_RF25_2.54._score.bed")
foo2 <- reduce(foo) #Merge neighboring positions while noting strand
o <- findOverlaps(foo,foo2)
scores <- split(foo$score, subjectHits(o))
foo2$scores <- unlist(lapply(scores, mean))

foo2 can then be exported. You can tweak the settings for reduce() if you want to allow a larger gap.


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1