Question

Change list of consecutive genomic coordinates to intervals?

1

Entering edit mode

6.5 years ago

patrick.turko ▴ 10

I have a long list of genomic coordinates in the format chromosome:position. Most or all of these could be expressed as intervals, ie chr:start-end, because the bases are consecutive. I can think of a bunch of approaches in R, but my list is 50 million lines long and I have a lot of them. Is there a fast way to do this?

If the input data looks like this:

I would like the output to look like this:

1:501-503
1:634-636
8:9982-9985
etc

The input data is in order, and each line is unique. Any ideas? I'm open to R/data.table/bioconductor, command line tools like BEDtools etc, unix utilities like awk or whatever. I would prefer to avoid python as it's not present anywhere else in this workflow.

sequence R intervals • 1.6k views

ADD COMMENT • link updated 6.5 years ago by Charles Plessy ★ 2.9k • written 6.5 years ago by patrick.turko ▴ 10

score 3 · Answer 1 · 2017-10-26

3

Entering edit mode

6.5 years ago

Pierre Lindenbaum 161k

using bedtools merge http://bedtools.readthedocs.io/en/latest/content/tools/merge.html

awk -F ':' '{printf("%s\t%d\t%d\n",$1,$2,1+$2);}'  input.txt|\
sort -t $'\t' -k1,1 -k2,2n |\
bedtools merge -i - |\
awk '{printf("%s:%d-%d\n",$1,$2,$3 -1);}'


1:501-503
1:634-636
8:9982-9985

ADD COMMENT • link 6.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you, this is very clean.

ADD REPLY • link 6.5 years ago by patrick.turko ▴ 10

score 1 · Answer 2 · 2017-10-26

Hi Patrick, assuming your data is in a file called MyData.intervals you can try these commands (mixture of sed, paste, and awk; saves results into a new file called MyNewData.intervals):

sed 's/:/\t/g' MyData.intervals | awk '{chr[NR]+=$1; bp[NR]+=$2} END {for (i=1; i<=NR; i++) if(chr[i]==chr[i+1] && ((bp[i]-1)!=bp[i-1])) print chr[i]":"bp[i]; else if (chr[i]==chr[i-1] && ((bp[i]+1)!=bp[i+1])) {print bp[i]}}' | paste -s -d'-\n' > MyNewData.intervals
cat MyNewData.intervals
1:501-503
1:634-636
8:9982-9985

Kevin

score 1 · Answer 3 · 2017-10-26

With Bioconductor:

> suppressPackageStartupMessages(library(GenomicRanges))
> reduce(
    GRanges(
      seqnames = c(1,1,1,1,1,1,8,8,8,8),
      ranges   = IRanges(
                   start = c(501,502,503,634,635,636,9982,9983,9984,9985),
                   width = 1)))
GRanges object with 3 ranges and 0 metadata columns:
      seqnames       ranges strand
         <Rle>    <IRanges>  <Rle>
  [1]        1 [ 501,  503]      *
  [2]        1 [ 634,  636]      *
  [3]        8 [9982, 9985]      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths