Question

detailed explanation of Insert Size

5

Entering edit mode

6.5 years ago

SMILE ▴ 170

Hi all,

I have read through many posts about insert size here. And see a very good answer about the insert size.

It is still not so clear for me to understand insert size. I hope some experts can make it clearer.

As illustrated in a good blog and a good anwser, the "insert size"=sequence between adapters (actually encompasses R1 and R2 as well as the unknown gap between them) and it is also known that the ninth column of the SAM file (TLEN) represents the insert size

However, here are some things I still don't understand.

First, in RNA seq data, if the alignments are spliced, and the TLEN reports the distance from the 5'-most to 3'-most position (if my understanding is right). So according to my understaning the TLEN number will include the possible introns which means the TLEN would be unsally longer than "actual insert size"?

Second, if we are mapping DNA sequences, then the fragment length and "insert size"/"template length" are the same?

Third, how Picard tools CollectInsertSizeMetrics actually do to calculate the insert size distribution of a paired-end library, does it only use the TLEN or exclude possible introns?

Any answer to help me better ubderstand this conception will be greatly appreciated.

RNA-Seq sequencing alignment • 6.0k views

ADD COMMENT • link updated 6.5 years ago by Devon Ryan 104k • written 6.5 years ago by SMILE ▴ 170

0

Entering edit mode

This is the best illustration for this: A: What is the different between Read and Fragment in RNA-seq?

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

Yes, this is included in the background of my question...

My question is:

First, in RNA seq data, if the alignments are spliced, and the TLEN reports the distance from the 5'-most to 3'-most position (if my understanding is right). So according to my understaning the TLEN number will include the possible introns which means the TLEN would be unsally longer than "actual insert size"?

Second, if we are mapping DNA sequences, then the fragment length and "insert size"/"template length" are the same?

Third, how Picard tools CollectInsertSizeMetrics actually do to calculate the insert size distribution of a paired-end library, does it only use the TLEN or exclude possible introns?

ADD REPLY • link 6.5 years ago by SMILE ▴ 170

1

Entering edit mode

Second, if we are mapping DNA sequences, then the fragment length and "insert size"/"template length" are the same?

Technically fragment length will never be equal to insert size (if you only consider size in bp) since fragment includes insert + Illumina adapters. If the DNA fragment does not contain a breakpoint/translocation then it would represent a contiguous stretch of DNA in genome.

I will let someone else tackle #1 and 3.

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

If you are interested in insert size calculation then use these directions (for BBMap tools).

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

Thank you for your advice, I will give it a try.

ADD REPLY • link 6.5 years ago by SMILE ▴ 170

score 4 · Accepted Answer · 2017-10-10

So according to my understaning the TLEN number will include the possible introns which means the TLEN would be unsally longer than "actual insert size"?

Yes, the TLEN field won't always be terribly useful in RNAseq. When trying to compute the original fragment sizes it's best to not have spliced fragments. Back when we used tophat2 in our production pipeline, our "insert size estimation" step aligned to the transcriptome to avoid this problem.

Second, if we are mapping DNA sequences, then the fragment length and "insert size"/"template length" are the same?

This will depend a bit on which fragment you're talking about. See the comment from genomax.

how Picard tools CollectInsertSizeMetrics actually do to calculate the insert size distribution of a paired-end library, does it only use the TLEN or exclude possible introns?

I've never run that tool on RNAseq data, I'm not sure how useful it would be. I would expect that it's just summarizing the TLEN field, so I'd expect some absurdly high mean values.