Question

Biopython construct a sequence from multiple reads

1

Entering edit mode

7.7 years ago

sefu ▴ 20

Hi all, I started using biopython for DNA sequence Analysis a while ago. I am know facing a more complex problem and I wonder if there might be a straight forward solution in Biopython which I have missed so far. I have a attached a short description of my problem in the end of that post for everyone who is not interested in the long description ;)

Detailed description of my problem:

I would like to reconstruct a DNA sequence in biopython starting from multiple reads (most likely 3 or 4 reads which cover the whole DNA sequence of a protein). The reads might be combination of forward and backwards reads and therefore of them might contain information of the sense and others of the anti-sense strand. I know the sequencing primer and the position of such primers for all reads. So it would easily be possible to get the reads in the right order and transform all of them in the sense strand for example. My problem is that there are will be overlap between the different reads which not always will have the same length. So for example if my protein is 999 bases long, I wont get three reads which sum up to 999 bases (3x333). I will most likely get three reads suming up to 1400 bases. Is there a straight forward way in python to remove the redundant information in the overlap regions between different reads and convert the reads into one sequence? I also know the reference sequences (on DNA level) of all proteins I am looking for in my analytical data. My plan was to first construct a single sequence out my reads and than compare that sequence to all reference sequences in order to identify if the sequence (i.e. the reads) in my analytical data set corresponds to one of the reference proteins. Since all reference proteins share a common framework and only differ in several specified positions, it would be possible to align the reads onto the reference sequence(s) in order to construct a single sequence (maybe an alignment is not even necessary since I know the sequencing primers, their order and position). Unfortunately, I have no easy idea how to solve the problem of sequence overlaps between the reads.

In short:

Is there an "easy" or straight forward solution in biopython for constructing a single sequence from multiple reads with overlapping regions in biopython?

Thank you in advance!

Biopython DNA-Seq Multiple Reads • 4.2k views

ADD COMMENT • link updated 7.7 years ago by Daniel ★ 4.0k • written 7.7 years ago by sefu ▴ 20

0

Entering edit mode

Any reason you want to do this in python? Sounds like a job for a de novo assembler (Velvet?) to me. In addition, what would you do with inperfect matches/overlaps?

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

1

Entering edit mode

Thank you for your comment. I would like to do this kind of deconvolution (identifciation which clone is in which well) for a larger set of data (several 96-well plates). I already prepared the code for a related task with only one read and included lots of statistics into my tool. That worked perfectly with biopython (but I am new to sequence analysis, so it might be that there are better ways to solve such problems) and I thought I just could extend/modify my previous tool. I do not realy have an idea how to handle inperfect matches, especially if the this happens in the overlap regions. In cases with sequence differences in the overlap regions, I thought of constructing all possible complete sequences starting from the reads (so taking one or the other part of the overlapping region) and comparing all possible constructs against the library of reference sequences.

Can Velvet handle larger amounts of data and can be used in a batch/command line mode?

Thank you very much

EDIT: I just saw that Velvet was designed for short reads. I will have about 3 up to 4 reads from sanger sequencing with a length between 700-1400 bases. The overlap between the reads will be designed to be longer than 50 bases.

ADD REPLY • link 7.7 years ago by sefu ▴ 20

0

Entering edit mode

I haven't used Velvet myself (therefore the question mark in my previous comment). It's designed for genome assembly so it will very likely be able to handle your larger amounts of data.

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

0

Entering edit mode

You will probably be able to find an assembler corresponding to your data and needs here: https://en.wikipedia.org/wiki/Sequence_assembly#Available_assemblers

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

score 0 · Answer 1 · 2016-07-27

EDIT: Wait, I'm missing the obvious thing here, the Bio.AlignIO biopython package: http://biopython.org/wiki/AlignIO. You should just be able to align your sequences direct with that.

I highly recommend using the CAP3 program that is specifically designed for doing this, anticipates mismatches, and will also take the quality scores of your reads into account. This was part of my standard toolkit back in the glory days of sanger.

You can run it online here (basic), or with tweakable parameters here, but you can also download and run the program from here. If you wanted to I imagine you could run your biopython steps, then pass the data to cap3, then read it back into script for continuing the pipeline fairly straightforwardly.