Question

Software to identify overrepresented k-mers in sequencing data

0

Entering edit mode

7.6 years ago

abascalfederico ★ 1.2k

Hi all,

I need to identify overrepresented k-mers in sequencing data. Ideally, I would need k-mers of lengths between 7 and 20 (I am searching for some sequencing adaptors remnants).

Anyone knows of a program able to do this?

Thanks! Federico

sequencing k-mer • 2.3k views

ADD COMMENT • link updated 7.6 years ago by Pierre Lindenbaum 161k • written 7.6 years ago by abascalfederico ★ 1.2k

score 2 · Accepted Answer · 2016-09-01

2

Entering edit mode

7.6 years ago

Pierre Lindenbaum 161k

fastqc with option kmer: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/11%20Kmer%20Content.html

   -k --kmers       Specifies the length of Kmer to look for in the Kmer content
                    module. Specified Kmer length must be between 2 and 10. Default
                    length is 7 if not specified.

ADD COMMENT • link 7.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks Pierre! That may be helpful but I would like to be able to search for longer kmers (up to 20 bps)

ADD REPLY • link 7.6 years ago by abascalfederico ★ 1.2k

2

Entering edit mode

fastqc is a shell script, change the following lines:

if ($kmer_size) {
    unless ($kmer_size =~ /^\d+$/) {
        die "Kmer size '$kmer_size' was not a number";
    }
    #### CHANGE 10 to WHATEVER...
    if ($kmer_size < 2 or $kmer_size > 10) {
        die "Kmer size must be in the range 2-10";
    }

    push @java_args,"-Dfastqc.kmer_size=$kmer_size";
}

use at your own risk.

ADD REPLY • link 7.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Minimum at 2 obviously makes sense. Any idea why they hard-coded the maximum at 10?

ADD REPLY • link 7.6 years ago by igor 13k

1

Entering edit mode

because 10 is not 'too much' in memory: there is potentialy 4^10= 10,48,576 unique keys in the map. k=20 would be : 1,099,511,627,776 ==> OUT OF MEMORY.

ADD REPLY • link 7.6 years ago by Pierre Lindenbaum 161k