How to merge bedGraph records which are next to each other and have the same score?
2
0
Entering edit mode
6.7 years ago
James Ashmore ★ 3.4k

Is anyone aware of software to merge bedGraph records if the score is the same? I have a bedGraph file calculated at single base pair resolution and I would like to decrease the file size by merging records next to each other which have the same score.

bedGraph • 3.4k views
ADD COMMENT
1
Entering edit mode

awk?

ADD REPLY
2
Entering edit mode
6.7 years ago

cat test.bdg

chrY    1   2   10
chrY    2   3   10
chrY    3   4   11
chrY    4   5   12
chrY    5   6   12
chrY    6   7   13
chrY    7   8   14
chrY    8   9   14
chrY    9   10  12

.

cat test.bdg  | groupBy -g 1,4 -c 2,3 -o min,max | awk -v OFS="\t" '{ print $1,$3,$4,$2}'

output:

chrY    1   3   10
chrY    3   4   11
chrY    4   6   12
chrY    6   7   13
chrY    7   9   14
chrY    9   10  12

Bedtools groupBy

ADD COMMENT
0
Entering edit mode

Thank you for the reply, however I think this will only work on small bedGraph files. I got the following error when I tried on my base-pair resolution bedGraph file:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/home/s1437643/conda/bin/groupBy: line 2: 12111 Aborted                 ${0%/*}/bedtools groupby "$@"
ADD REPLY
0
Entering edit mode

Did it work for small file and failed for large file ?

ADD REPLY
1
Entering edit mode

Actually it ended up being an error in the bedtools installation I had, specifically the groupBy command. I've compiled from the last source code and it works perfectly. Thank you!

ADD REPLY
1
Entering edit mode
6.7 years ago

Via BEDOPS, bash and GNU core utilities:

$ SORTED_BEDGRAPH=in.bedGraph
$ while read -r score; do awk -v s=$score '$4==s' ${SORTED_BEDGRAPH} | bedops --merge - | awk -v s=$score '{print $0"\t"s}'; done < <(cut -f4 ${SORTED_BEDGRAPH} | sort | uniq) | sort-bed - > answer.bed

This should perform decently and scale to large inputs.

Use sort-bed if you first need to sort the bedGraph file, so that merging works correctly.

For others working with BED, instead of bedGraph, the score data are usually in the fifth column, which would need adjusting of the two awk and cut -f4 statements.

ADD COMMENT

Login before adding your answer.

Traffic: 2794 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6