Question

What does a sequence read look like if cycle count is larger than strand length

1

Entering edit mode

7.4 years ago

L. A. Liggett ▴ 120

I realized that I don't know what happens when illumina sequencing chemistry reaches the end of a fragment. Does the reaction stop for that fragment, or are bases added in some way? The reason I ask is because I have fragments ranging from 120-160bp in length and yet on 150 cycles I will always get 150bp long reads. And some of my reads will end in terminal repeats like this:

CGTCTTCTGCTTGAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

next-gen sequencing • 4.2k views

ADD COMMENT • link updated 7.4 years ago by Brian Bushnell 20k • written 7.4 years ago by L. A. Liggett ▴ 120

0

Entering edit mode

Does your fragment size include the sequencing adapters?

ADD REPLY • link 7.4 years ago by John 13k

0

Entering edit mode

No the adapters are already trimmed off.

ADD REPLY • link 7.4 years ago by L. A. Liggett ▴ 120

0

Entering edit mode

wait, er, what? Then you should sequence into the adapters no? This doesn't come after the adapters does it?

ADD REPLY • link 7.4 years ago by John 13k

0

Entering edit mode

Yeah that was unclear. I was trying to say that the adapters were trimmed from my data already, but I guess maybe this is not the case.

ADD REPLY • link 7.4 years ago by L. A. Liggett ▴ 120

0

Entering edit mode

Is this RNA-seq on NextSeq/MiniSeq?

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

This is DNA sequencing on the hiseq 4000.

ADD REPLY • link 7.4 years ago by L. A. Liggett ▴ 120

0

Entering edit mode

Hm, my guess was based on the polyA and polyG :p

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes, normally on a HiSeq 2500, it seems like there's mainly poly-A, while on NextSeq there's poly-A for a little while then poly-G. I also thought it was probably NextSeq. This might be because NextSeq and HiSeq 3000+ both need the same base-calling software; our 2500s are using an older version than we use for our NextSeq.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

1

Entering edit mode

From my understanding the polyG on NextSeq/MiniSeq is due to the two-colour chemistry of those sequencers compared to four-colour on the other machines. On NextSeq/MiniSeq, absence of signal indicates a G. (see also this post on qcfail)

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

That's true, but it does not explain the poly-A prior to the poly-G. In fact, the poly-A on NextSeq tends to be the same length for every read, so it actually appears in the consensus of BBMerge's "outa" results:

adapter sequence - poly-A - poly-G

I don't know whether the poly-A is actual signal, or a base-caller artifact.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Since A is the result of imaging both dyes, might be just decaying noise... who knows!

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Once we go off the end of a adapter-fragment-adapter construct, sequencer may start sequencing into the adapter lawn present on the flowcell. This can lead to very odd results (polyA's can be one manifestation).

ADD REPLY • link 7.4 years ago by GenoMax 141k

1

Entering edit mode

It's highly unlikely that the signal represents sequencing into the adapter lawn. You'd have to invoke some bizarre mechanism of strand dissociation/mismatch annealing/synthesis that hasn't been reported previously for DNA polymerases (whose biochemical properties have been studied intensively for about five decades).

More likely is addition of an untemplated 3' A (a known activity of Taq and similar DNA polymerases) to a subset of molecules for a few cycles until that peters out, then G calls afterward due to background/non-signal. Or it could be a function/artifact of the base-caller (much like the '2' PHRED score conventionally signifies a run of low Qs at the end of the read).

Disclaimer: rampant speculation on my part!

ADD REPLY • link 7.4 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

I recall that being offered as a possible explanation by someone in Illumina tech support a long time ago. But I don't have any hard evidence of that conversation/a document I can point to. AFAIK the sequence of the adapters on the flowcell is a trade secret.

ADD REPLY • link 7.4 years ago by GenoMax 141k

0

Entering edit mode

The nucleotide sequence on the flow cell has to be complementary to the adapter for the library to anneal. But the moiety used to tether it and any chemical modifications are indeed proprietary.

ADD REPLY • link 7.4 years ago by harold.smith.tarheel ★ 4.9k

score 6 · Accepted Answer · 2016-12-06

6

Entering edit mode

7.4 years ago

Brian Bushnell 20k

"CGTCTTCTGCTTG" matches the end of an Illumina adapter sequence. After you run off the end of an adapter, it's common to get poly-A, and sometimes eventually poly-G. So, no, your adapters have NOT been trimmed - you are sequencing into them, and then off the end into no signal.

You can (and should) trim your adapters, as described in this post.

Also, to clarify, the Illumina sequencing machine has no idea how long your fragments are. You tell it to run for 150 cycles, so it gives you 150 cycles of data for every read, regardless of the actual length of the molecule.

ADD COMMENT • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks that makes sense. But how does the fragment continue to polymerize with no template for incoming bases to bind?

ADD REPLY • link 7.4 years ago by L. A. Liggett ▴ 120

1

Entering edit mode

It doesn't. But the instrument still attempts to call bases from background fluorescence. Typically, the Q scores from off-the-end sequences are really low.

ADD REPLY • link 7.4 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

the fragments have had adapter sequences ligated at both the 5' and 3' ends, when the insert is shorter than the sequencing cycle you read into the adapter sequences, that's why you get adapter sequences at the 3' ends of some reads.

ADD REPLY • link 7.4 years ago by mastal511 ★ 2.1k

0

Entering edit mode

Sorry to hijack the question/answer, but I don't feel this warrants a new question:

FASTQ files coming off all modern NGS machines typically have N sequenced entries, where every entry contains the same number of sequenced bases. If the first entry is 150bp long, chances are they all are 150bp long. Of course the standard permits mixing of sequence lengths, but i'm curious if any machine ever outputted a FASTQ file with entries of different sequencing length. To my knowledge, even the old Sanger sequencers I used to work with were programmed to do X cycles, and even if it sequenced multiple samples in parallel, they all got the same number of cycles. Maybe i'm wrong?

ADD REPLY • link 7.4 years ago by John 13k

2

Entering edit mode

PacBio and Oxford Nanopore (not natively fastq) while have different read lengths. But for illumina I think it's safe to expect that untrimmed reads have same length...

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

2

Entering edit mode

If your sequencer is setup to use the on-board software (e.g. MiSeq reporter) to post-process data it will look as if you may have sequencing reads with different lengths (if they get trimmed). Otherwise every read that passes Illumina's chastity filter will be of identical length.

ADD REPLY • link 7.4 years ago by GenoMax 141k

0

Entering edit mode

Thank you both - that's very interesting to get some real-world examples of FASTQs that could have variable sequence lengths :)