AWK solution

Question

Code golf: detecting homopolymers of length N in the (human) genome

3

Entering edit mode

4.9 years ago

WouterDeCoster 47k

How would you feel about a code golf? Give us your pretiest, shortest, quickest pieces of code! Especially languages that nobody else has posted are welcome...

Objective: create a bed file of all homopolymers (repeats of the same nucleotides) of minimally length N based on the (human or other) reference genome.

The input is a fasta file, for example the human genome

Expected output example:

chr1    11540   11546
chr1    14908   14913
chr1    15468   15473
chr1    16318   16323
chr1    16505   16511
chr1    19735   19741
chr1    20316   20321

A useful benchmark set would be the human chromosome 22. My python code (see below), searching for 5-mers or longer, finds 244503 hits. The first 10 lines are:

chr22   16050521    16050526
chr22   16050548    16050553
chr22   16050570    16050575
chr22   16050578    16050583
chr22   16050679    16050684
chr22   16050835    16050840
chr22   16050932    16050937
chr22   16051192    16051198
chr22   16051303    16051310
chr22   16051311    16051317

code golf fasta homopolymer repeat • 4.8k views

ADD COMMENT • link updated 2 days ago by Ram 43k • written 4.9 years ago by WouterDeCoster 47k

0

Entering edit mode

DUPLICATE ! DUPLICATE ! DUPLICATE ! How to extract all the simple repeats from the hg19 reference genome

:-D

ADD REPLY • link 4.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

We're waiting for your html solution Pierre

ADD REPLY • link 4.9 years ago by WouterDeCoster 47k

score 5 · Answer 1 · 2019-05-13

AWK solution

NB - this is a memory hog. It reads the entire FASTA file into memory. It could be designed to be more memory efficient, but works pretty quick on a cluster environment, and is also fine for smaller FASTA files on a laptop/desktop computer.

1,set desired `minimum` homopolymer length as shell variable

HOMOPOLYMER_LENGTH=10

2, determine homopolymers

This first 'flattens' the FASTA file and converts everything to upper case to avoid issues with soft- / repeat-masked sequence

awk '/^>/{if (NR==1) {print $0} else if (NR>1) {print "\n"$0}}; !/^>/ {printf toupper($0)}' GCA_000001405.15_GRCh38_no_alt_analysis_set.fna |\
  awk -v len="${HOMOPOLYMER_LENGTH}" -F '' '/^>/{gsub("^>", "", $0); chr=$0}; !/^>/ {base=$1; start=1; end=1; for (i=2; i<=NF+1; i++) {if (base==$(i)) {end=i} else if (base!=$(i)) {if (end-start>=len-1) {print chr"\t"start"\t"end"\t"base"\t"(end-start)+1}; start=i; end=i; base=$(i)}}}'

chr1    1   10000   N   10000
chr1    28589   28603   T   15
chr1    31720   31733   A   14
chr1    33450   33464   A   15
chr1    33531   33541   T   11
chr1    36352   36364   A   13
chr1    37241   37250   T   10
chr1    40640   40658   T   19
chr1    43797   43812   A   16
chr1    44174   44183   A   10
chr1    46403   46416   T   14
chr1    47319   47328   A   10
chr1    51865   51877   A   13
chr1    61351   61361   A   11
chr1    71176   71186   A   11
chr1    71847   71861   T   15
chr1    73843   73856   T   14
chr1    77175   77195   A   21
chr1    82134   82154   A   21
chr1    91201   91213   A   13

The extra 4th and 5th columns are the homopolymer base and its length, respectively.

Kevin

score 3 · Answer 2 · 2019-05-13

3

Entering edit mode

4.9 years ago

Devon Ryan 104k

This won't win any awards for brevity, but I'd already written it for other purposes. If it helps, the only bit of code it uses that I didn't write myself is argparse.

ADD COMMENT • link 4.9 years ago by Devon Ryan 104k

score 3 · Answer 3 · 2019-05-13

I have a solution part in Go (because it does it really efficiently), part in Bash (for the golfing):

package main
import "fmt"
import "index/suffixarray"
import "io/ioutil"
import "regexp"
import "os"

func main() {
     M := os.Args[1]
     r := "[aA]{"+M+",}|[cC]{"+M+",}|[gG]{"+M+",}|[tT]{"+M+",}"
     R, _ := regexp.Compile(r)
     b, _ :=  ioutil.ReadAll(os.Stdin)
     index := suffixarray.New(b)
     fmt.Println(index.FindAllIndex( R, -1))        
}

The Go standard library includes a suffix array, which does all the heavy lifting for us.

Then you can do this for each chromosome (maybe in parallel for the whole genome, why not.) The chromosome is input by the standard input, and 5 in this case is the minimum size (called M in the Go code.)

Using chr22:

$ time tail -n+2 chr22.fa | ./homopolymer 5 | sed -e 's|\] \[|\n|g' | tr -d "[" | tr -d "]"  | sed 's|^|chr22\t|g' > homopolymer.bed


$ head homopolymer.bed 
chr22   16371531 16371536
chr22   16371581 16371586
chr22   16371589 16371594
chr22   16371692 16371697
chr22   16371851 16371856
chr22   16371950 16371955
chr22   16372215 16372221
chr22   16372329 16372336
chr22   16372337 16372343
chr22   16372418 16372425

Timing:

real    1m11.996s
user    1m12.240s
sys     0m0.408s

score 3 · Answer 4 · 2019-05-14

3

Entering edit mode

4.9 years ago

Pierre Lindenbaum 161k

C solution:

$ time ./biostar379454 chr22.fa 5
(...)
real    0m2.439s
user    0m0.475s
sys 0m0.538s

ADD COMMENT • link 4.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I suspect this will be faster by not mmaping the file but instead reading in block sized chunks.

ADD REPLY • link 4.9 years ago by Devon Ryan 104k

Ram · Answer 5 · 2019-05-13

Takes in fasta file and a second parameter for homopolymer length.

It streams through each line to find homopolymers. Outputs chromosome, start, end, homopolymer base, length of homopolymer.

I tested it out on this fasta file:

>A
AGTCAAAA
GGGGTTTTCCCC
>B
AGTCCCCCTTTTAAAA
GGGGTTTTCCCC
AAAATTTT

And it outputs:

A       4       8       A       4
A       8       12      G       4
A       12      16      T       4
A       16      20      C       4
B       3       8       C       5
B       8       12      T       4
B       12      16      A       4
B       16      20      G       4
B       20      24      T       4
B       24      28      C       4
B       28      32      A       4
B       32      36      T       4

Here's the script

import sys

inFile = open(sys.argv[1],'r')
l = int(sys.argv[2])
chrome = base = start = end = -1

for line in inFile:
    if line[0] == '>':
        if end - start > l:
            print('\t'.join(map(str,[chrome,start,end,base,end - start])))
        chrome = line.strip().split()[0][1:]
        base = -1
        start = end = 0
    else:
        for b in line.strip():
            if b != base:
                if end - start > l:
                    print('\t'.join(map(str,[chrome,start,end,base,end - start])))
                start = end
                base = b
            end += 1

if end - start > l:
    print('\t'.join(map(str,[chrome,start,end,base,end - start])))

Ram · Answer 6 · 2019-05-13

I'll start with a Python snippet I had lying around, but can it be improved?

As input it takes just a fasta file (sys.argv[1]). This example searches for homopolymers of length 5 and up.

import sys
import re
from Bio import SeqIO

for nucl in ['A','C','T','G']:
    pattern = re.compile(nucl + "{5,}")
    for record in SeqIO.parse(sys.argv[1], "fasta"):
        for match in pattern.finditer(str(record.seq).upper()):
            print("\t".join([record.id, str(match.start()), str(match.end())]))

Ram · Answer 7 · 2019-05-13

Using R, there must be a Biostrings-way of doing this, I am just using rle:

library(Biostrings)

N = 10

# http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz
x <- readBStringSet("../chr22.fa")
xRLE <- rle(unlist(strsplit(toupper(toString(x)), "")))

pos <- cumsum(c(1, xRLE$lengths))
ix <- which(xRLE$lengths >= N & xRLE$values %in% c("A", "T", "C", "G"))

res <- data.frame(chrom = "chr22",
                  start = pos[ ix ],
                  end = pos[ ix ] + xRLE$lengths[ ix ] - 1)

head(res)
#   chrom    start      end
# 1 chr22 16069784 16069795
# 2 chr22 16072433 16072447
# 3 chr22 16082599 16082608
# 4 chr22 16085351 16085376
# 5 chr22 16085557 16085568
# 6 chr22 16096059 16096068

Let's test:

subseq(x, 16069784, 16069795)
# A BStringSet instance of length 1
# width seq                          names               
# [1]    12 aaaaaaaaaaaa                 chr22
subseq(x, 16072433, 16072447)
# A BStringSet instance of length 1
# width seq                          names               
# [1]    15 AAAAAAAAAAAAAAA              chr22

AWK solution

1,set desired minimum homopolymer length as shell variable

2, determine homopolymers

1,set desired `minimum` homopolymer length as shell variable