Assemble short reads based on k-mers
3
0
Entering edit mode
8.8 years ago
venu 7.1k

Hello all,

I am completely new for this kind of tasks. I have data like this,

>in0
GATCCTCGAAGTTACACGGG
>in1
TACGTCGACGTCAATCCGGG
>in2
TACACGGGCCGCTCCTGGGC
>in3
ACGGGGTACTACGAGACGCG
>in4
AGGGGGAATGTGGTCCACAT
>in5
TCCACATGGCTTGCTCCTGA
>in6
CTTGACGTTATGAATTTCGC

and so on..I need to assemble these short reads. I want to use perl for this. I just need a pseudo code on how to do this or direct me to a good resource. At the end I need a single string containing consensus sequence.

perl k-mer assembly • 3.0k views
ADD COMMENT
2
Entering edit mode

Is their a reason you want to reinvent the assembly wheel? There are a good number of assemblers already written, why bother writing yet another one without a good reason?

ADD REPLY
0
Entering edit mode

Can you give me some examples, so that I can find them directly on the internet.

ADD REPLY
2
Entering edit mode
ADD REPLY
0
Entering edit mode

As orange said, SOAPdenovo is one option. Others would include Trinity and Minia. There are quite a few of these if you just search pubmed for "DNA assembler" or "DNA assemble".

ADD REPLY
0
Entering edit mode
8.8 years ago
orange ▴ 30

why not try SOAPdenovo instead of perl ? Did you not aseembly genome sequence before ?

ADD COMMENT
0
Entering edit mode
8.8 years ago
thackl ★ 3.0k

Assuming you want to stick to Perl for educational purpose:

Here is some code to quite efficiently create kmers with Perl: https://github.com/thackl/perl5lib-Kmer/blob/master/lib/Kmer.pm.

Perl is not really made for handling graph structures, but there is one module that you could use to set up a De-Bruijn structure: http://search.cpan.org/~jhi/Graph-0.96/lib/Graph.pod. I played around with it some time ago but did not follow through.

ADD COMMENT
0
Entering edit mode

I just need a single string of consensus sequence from the above shown file

ADD REPLY
0
Entering edit mode

But definitely in Perl?

ADD REPLY
0
Entering edit mode

Not exactly, but any simple program that receives the above file as input and outputs the consensus string. I can understand the perl code easily, so the tag.

ADD REPLY
0
Entering edit mode

But your data sets are small - you want to do some form of microassembly?

ADD REPLY
0
Entering edit mode

Yes. I've 200 such reads in a file and I want an output like

 GGCATTTAACCGAAGCCGGTGGGTTAGACTATGATCCTCGAAGTTACACGGGCCGCTCCTGGGCGTGGCTGCTCCCAGCCCTAGCCCCAATGTAATATAAAGGTCGTGCCCAGTTAGCGTTAAGCAAGAGGTGTTACAAATATCTTGGAGAGTCATGTCGCAATTCTTGACGTTATGAATTTCGCGGTGAACAATGTCGCCCAGAATGGCAGGTCATGAAAAGCTTCAGCGGGAACCAGCAC....
ADD REPLY
0
Entering edit mode

What is the coverage of the reads and the length?

ADD REPLY
0
Entering edit mode

I am doing this kind of work for the first time. What I know is each read has different lengths of k-mers.

ADD REPLY
2
Entering edit mode

Take a look at this: http://www.homolog.us/Tutorials/index.php?p=1.1&s=2

This should give you a basic overview about assemblies

ADD REPLY
0
Entering edit mode
8.8 years ago
thackl ★ 3.0k

You can use SPAdes:

spades.py --only-assembler --sc -k 33 -s in.fq -out asm
  # results are in asm/contigs.fa

This should also work for lowish coverage (5-10X) but assumes little to no errors in your data. Also, there can be multiple contigs and regions with very low coverage (just a 1-3 reads at a certain position, e.g. the ends of you target sequence) will be missing.

ADD COMMENT

Login before adding your answer.

Traffic: 1933 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6