Question

Trying to wrap my head around Illumina paired end sequencing

1

Entering edit mode

3 months ago

ScottDansk ▴ 10

In Illumina paired end sequencing, I am confused how pairs end up with different sequences that overlap

I will give a basic example of paired end sequencing results from this DNA strand:

5' ATTTGCCCGC 3'
3' TAAACGGGCG 5'

This DNA strand is 10bp long and for this example I will be doing 7 cycles so 7bp reads.

For the forward strands we would end up with the following DNA molecules bound to the flow cell (after bridge amplification and the reverse strands cleaved & washed): With the 3' end being exposed and the 5' end bound to flow cell

F1: 5' GCGGGCAAAT 3'
F2: 5' ATTTGCCCGC 3'

which would result in the following 7bp reads for F1 and F2:

F1: 3' CCGTTTA 5'
F2: 3' ACGGGCG 5'

Then, for the reverse strands we would end up with the following DNA molecules bound to the flow cell after the second round of bridge amplification and forward strands cleaved & washed:

R1: 5' ATTTGCCCGC 3' 
R2: 5' GCGGGCAAAT 3'

which would result in the following 7bp reads for R1 and R2:

R1: 3' ACGGGCG 5'
R2: 3' CCGTTTA 5'

Therefore for my pairs, F1R2 and F2R1 I would end up with the following reads:

F1: 3' CCGTTTA 5'
R2: 3' CCGTTTA 5'

F2: 3' ACGGGCG 5'
R1: 3' ACGGGCG 5'

They are exactly identical. I know inn sequencing you would not expect F1R2 and F2R1 pairs to be identical, doing this example I assumed they would overlap, considering 7bp each and 10bp total insert size, they would overlap by 4bp and have 3bp either side not matching.

Have I got something wrong in my theory?

Thanks!

paired-end illumina sequencing • 860 views

ADD COMMENT • link updated 3 months ago by Istvan Albert 100k • written 3 months ago by ScottDansk ▴ 10

1

Entering edit mode

I made this drawing some times ago ( published in JOSS), it might be of help:

enter link description here

ADD REPLY • link 3 months ago by Juke34 8.5k

0

Entering edit mode

i am confused how pairs end up with different sequences that overlap

In real life one rarely wants sequences to overlap (special case libraries, short inserts etc). Having sequences overlap gives you a second read-out on the data but Illumina sequencing has become standard enough that one does not need to worry about technical replication any longer.

Graphic in this thread should prove useful: What is the difference between paired end reads and overlapping reads, and then why merge overlapping reads before assembly?

Edit: Sequencing always proceeds in a 5' --> 3' manner (on either strand) so that should be kept in mind.

ADD REPLY • link 3 months ago by GenoMax 141k

score 5 · Answer 1 · 2024-01-08

5

Entering edit mode

3 months ago

Istvan Albert 100k

I feel like you are describing the process in an overly complicated manner. It is hard to follow what you mean in each step and how you got the contradiction or even what that contradiction consists of.

Here is how I like to think about it in a simple way. First, the double strands are split into single strands. Now, suppose one of the single strands captured on the flow cell in the 5' -> 3' direction was

ATTTGCCCGC

In the first pass, the sequencer will generate read 1

--R1-->
ATTTGCCCGC

then the sequence is reverse complemented and flipped in 5'->3' direction, and now read 2 is generated

--R2-->
GCGGGCAAAT

So the two reads will contain ATT... and GCG...

ADD COMMENT • link 3 months ago by Istvan Albert 100k

0

Entering edit mode

Thank you for your response! I know my question was vague, sorry about that! It is a very complicated topic so i am trying to break it down

so on the flow cell you have this 1 strand of DNA: (i will call this strand 1)

flow cell: 5' ATTTGCCCGC 3'

Read 1:

your first read (read1) would be:

    3' <--R1-- 5'
 5' ATTTGCCCGC 3'

as synthesis occurs 5' > 3' from the synethsising strand perspective

so R1 = 3' ACGGGCG 5'
or 5' GCGGGCA 3'

Read 2:

then the complement would be created for read 2 and you would have:

flow cell: 5' GCGGGCAAAT 3'

then your second read would be:

   3' <--R2-- 5'
5' GCGGGCAAAT 3'

so R2 = 3' CCGTTTA 5'
or 5' ATTTGCC 3'

So that ends up with the same as you two reads containing ATT (R2) and GCG (R1)

So in terms of paired reads, these would not be paired as the came from the same original strand, or would they be paired because of this?

In my understanding they are not paired so they would be 1read of a F1R2 and 1 read of a F2R1 pair of reads?

i.e in this example and my understanding Read 1 would be F1 and Read 2 would be R1

They other pair would come from the same original piece of DNA, but the - strand ie: (will call this strand 2)

our starting strand on flow cell (from example before):

 5' ATTTGCCCGC 3'

complementary strand would then be

flow cell: 5' GCGGGCAAAT 3'


   3' <--R1-- 5'
5' GCGGGCAAAT 3'

then the complement:

   3' <--R2-- 5'
5' ATTTGCCCGC 3'

so R1 = 3' CCGTTTA 5'
and R2 = 3' ACGGGCG 5'

so from this complementary strand you would then get the pair, for this example it would be F2 and R2.

Then mapping would pair the reads and you would end up with F1R2 and F2R1.

My original (attempted) question is that when i work this out, F1 and R2 are identical and F2R1 are also identical, but this is not how it looks in bam files, instead it looks like

ref genome ===============
F1R2 pair: -------- (+ strand)
               ----------- (- strand)

But in my example, you have:

strand 1 read 1 = 3' ACGGGCG 5'
strand 1 read 2 =  3' CCGTTTA 5'

strand 2 read 1 =  3' CCGTTTA 5'
strand 2 read 2 = 3' ACGGGCG 5'

Therefore you end up with strand 1 read 1 identical to strand 2 read 2. In this example i thought strand 1 read 1 would be a pair with strand 2 read 2 (F1R2) ... but the would always be identical and never overlap like in the bam file illustration above.

I know i must have something the wrong way round in my workings out, but not sure where i am going wrong. thanks for your help on this :)

ADD REPLY • link 3 months ago by ScottDansk ▴ 10

1

Entering edit mode

Again, your examples are too long and seemingly start off the wrong way, so it is challenging to follow each example when it seems to run off track from the start.

You should not use the words forward and reverse strand when referring to read 1 and 2 of a paired-end sequencing (For what is worth, this inconsistency is widespread. Countless documents call read 1 as forward ... but then it leads to various contradictions similar to what you experience)

You should consider using the word sense and antisense relative to the fragment. The fragment that gets sequenced may have come from either the forward or reverse strand of the genome, and the generated reads will progress in the sense and antisense direction of the fragment.

Suppose you have a fragment 5' ATGC ... ATGC 3'

Where the ... indicates that we don't quite know how long the fragment is.

This fragment could be from either the forward or reverse strand. We don't know that just from looking at the fragment.

The first read generated from this fragment starts with ATGC and goes on as long as the cycles; the typically desirable behavior would be not to reach the other end nor overlap with read 2.

We don't know whether read 1 represents the forward or the reverse strand. All we know is that it follows the same direction as the fragment.

In the second stage, the fragment gets reverse complemented. The second read corresponds to sequencing the reverse complement from the other side, and the read will start with GCAT, But again, we don't know if that matches the forward or reverse relative to the genome. All we know it is the opposite strand of read 1. All we know that it is in the antisense direction of the original fragment.

This is where the labeling mistake occurs, people call the strand opposite to read 1 as reverse strand. But they mean the reverse of read 1, not the original meaning of forward/reverse as defined relative to the genome.

Mixing up these two concepts, using words such as forward and reverse when sense and antisense are more appropriate, leads to a lot of confusion.

ADD REPLY • link 3 months ago by Istvan Albert 100k

0

Entering edit mode

Hello, thank you again for replying. I feel i have confused myself countless times but this helped a lot, thank you!

I feel that I have worked this through in my head now:

In your example this fragment could be from either forward or reverse strand of the genome, we don't know at this point.

Read 1 sequences it, and then it is reverse complemented, and sequenced from the other end.

So you end up with 2 reads, one which maps to the forward strand of the genome and one the reverse (the sense and antisense sequences of the initial fragment)

During alignment, BWA MEM will align to the reference genome and then can work out which strand of the reference each read aligns to:

i.e: if read 1 aligns to forward strand of the genome - it will be F1 and therefore read 2 will align to the reverse strand meaning it will be R2. Therefore they will be a F1R2 pair

Hopefully this is correct :D thanks again

ADD REPLY • link 3 months ago by ScottDansk ▴ 10

1

Entering edit mode

yes, this looks correct once

once we align the reads we can tell whether the forward or reverse fragment was sequenced from the orientation of the read pair,

knowing that read 1 was sequenced first

ADD REPLY • link 3 months ago by Istvan Albert 100k