join non-overlapping paired-end reads
1
1
Entering edit mode
9.9 years ago
Daniel ★ 4.0k

I have a batch of illumina paired end reads with a shorter than anticipated reverse sequence for a ~550bp amplicon.

The joining tools which I have used before (i.e. USEARCH, ea-utils) talk about merging the overlaps and discarding other sequences but is there a way to join forwards and reverses leaving a gapped interval?

Expected output:

>seq1
ATGCAGTCGATCGATCGACTGCATGCATCGATCGAATCGATCG--------------TGCAGTCGATCGTACG
>seq2
CAGTCTAGCTACGATCGATCGACTGCATACGTACGTACG-------------------CTGCATGCATCGTAG
>seq3
CTAGACTGATCGCTAGCCTAGCTACGATCGATCGACACTGCTGTCGACTGACTGATCGATCGATCGACTGCAT

Thanks

paired-end illumina • 8.1k views
ADD COMMENT
1
Entering edit mode

You might consider checking out FLASH. It is really fast and has many options.

ADD REPLY
0
Entering edit mode

Thanks, this looks good.

ADD REPLY
0
Entering edit mode

Be aware that FLASH actually has a fairly high false positive rate.. It is still pretty useful though! However, i thought FLASH also can only concat mates which actually overlap?!?!

ADD REPLY
1
Entering edit mode

You're correct regarding requiring overlap. Nothing could accurately work without it unless the reads have already been mapped. FYI, it doesn't concatenate sequences, but merges them. Actually concatenating them wouldn't be very useful.

ADD REPLY
0
Entering edit mode

Is there any tool to merge (i.e concatenate R1 and R2)the aligned reads ? This is particularly helpful in population genomics if you do not want linked SNPs i.e people need one snp per loci. If Read 1 has a SNP, they do not want to consider the SNP from Read 2.

ADD REPLY
0
Entering edit mode

It would be simpler to just use a variant caller that won't do weird things in those cases. For example, samtools mpileup would set the phred score of the lower of those base calls to 0, which could then be trivially filtered.

ADD REPLY
0
Entering edit mode

To directly answer you question, probably though I don't know of one off-hand. In any case, that'd be simple to write.

ADD REPLY
0
Entering edit mode

I am trying in Pysam. It is giving some wired error discussed here Mate Pairs In Pysam

ADD REPLY
0
Entering edit mode

Presumably the mate really isn't in the BAM file. Try using grep with samtools to see.

ADD REPLY
0
Entering edit mode

Its working. For some reasons, my read names have _1 and _2 which makes them different according to parser.

ADD REPLY
0
Entering edit mode

That's not the best parser. Mates don't need to have the same name, so someone didn't think that out properly.

ADD REPLY
0
Entering edit mode

Am I correct in assuming that you've already aligned the reads?

ADD REPLY
0
Entering edit mode

At this point, no. Are you saying I should align the forwards and reverses against a reference first before attempting to join?

ADD REPLY
0
Entering edit mode

If they don't have significant overlap, then yes (it'd be impossible to accurately judge the distance between them otherwise).

ADD REPLY
0
Entering edit mode

Better yet, explain what purpose is this endeavour supposed to satisfy?

ADD REPLY
0
Entering edit mode

Were you able to determine a good method for doing this? I am also interesting in merging non-overlapping reads which are from a friend's sequencing run in which they used primers targeting a 643 bp fragment which unfortunately is slightly too large to produce an overlap during a MiSeq run.

ADD REPLY
0
Entering edit mode
9.7 years ago

I think the direct answer to your question is that, no, you cannot merge paired-end reads that do not overlap. However, you could attempt to merge all pairs (using tools already noted) and then leave the non-merged pairs as pairs. Then, align the merged reads and the paired-ends independently and combine the output. I am not sure how merging will impact your results relative to not merging.

ADD COMMENT

Login before adding your answer.

Traffic: 2969 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6