Question

STAR Splice Junctions normalization

0

Entering edit mode

5.3 years ago

Bastien Hervé 5.3k

I'm trying to investigate splicing events in RNA-seq experiments over multiple libraries.

STAR got a SJ.out.tab output file listing splicing events over one library.

After alignments I got one SJ.out.tab per library.

I would like to face splicing event counts between libraries, but libraries do not have the same amount of mapped reads, which lead to an impossible comparison.

Is it a way to normalize this kind of count. Something else than divide by the library size ?

Running featureCounts then DESeq2 to get a sizeFactor to apply to my splicing event counts ?

I'm aware about some multiple bias engaged in RNAseq experiment

genes length : I want to compare gene1 against gene1 in different conditions so it's OK
genes GC composition : I want to compare gene1 against gene1 in different conditions so it's OK
RNA population composition for each condition : Biologically, I expect no variation in the amount of expressed genes,only more or less splicing events
batch effect : The design is a bit messy, but I'll just ignore it for now (I'm in the exploration step)

Remains :

Library size : Experiments took place in different ships so the library size are very different

Thanks !

Edit : I'm not looking for gene isoforms. I'm more interested in gene recombinaison (around 170 000 bp) as I'm working with B cells.

STAR RNA-Seq • 3.1k views

ADD COMMENT • link 5.3 years ago by Bastien Hervé 5.3k

0

Entering edit mode

I would add that you also need to normalize by gene expression. Otherwise you'll see increase in all splicing events of an upregulated gene. A common thing to do is to normalize by reads across an exon for example, by computing the PSI (percent spliced in).

ADD REPLY • link 5.3 years ago by Martombo ★ 3.1k

0

Entering edit mode

My biologists cleam that there sould not be any variation in gene expression over conditions only a modification of splicing events. I'm not looking for gene isoforms, I'm interested in large genes recombinaison (immune cells)

ADD REPLY • link 5.3 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Do you want to investigate the actual splicing events or just the numbers of events per gene per condition?

If you aim for the latter, you may approach it with a Fisher's exact test per gene and do afterwards multiple hypothesis correction. The contingency table would be something like number of events in two conditions x {gene1,backgroud} .

ADD REPLY • link 5.3 years ago by michael.ante ★ 3.8k

0

Entering edit mode

I don't want to investigate the actual splicing events. But I don't want neither have a general splicing count.

I have a very small area of reseach, let's say 200 000 pb on a specific chromosome. Using my gtf file I know all the genes over this region (maybe 20 genes). I want to say "In condition1 I've got more gene recombinaison events merging gene A and gene B than in my conditionB"

I filtered my count table (SJ.out.tab STAR output) to only keep my area of interest :

GeneA    GeneB    150 (reads covering the recombinaison)
GeneC    GeneE    20 (reads covering the recombinaison)
....

I'm looking for a normalization of theses read counts.

ADD REPLY • link 5.3 years ago by Bastien Hervé 5.3k

1

Entering edit mode

Let's say you have in total 1000 junctions reads in condition A and 1500 in conditions B. For the junctions Gene1 to Gene2, you have 150 in condition A and 20 in condition B.

fisher.test(matrix(c(150,(1000-150),20,(1500-20)), ncol=2))
--> p< 2.2e-16

In case you have 200 instead of 20 reads in condition B:

fisher.test(matrix(c(150,(1000-150),200,(1500-200)), ncol=2))
--> p=0.24

Using this approach no normalisation is needed.

ADD REPLY • link 5.3 years ago by michael.ante ★ 3.8k

0

Entering edit mode

Thanks it's clearer ! The number of junctions Gene1 to Gene2 could be very different so I'll need to normalize in any way.

Which kind of normalization should I apply to this type of data ? Is linear regression fit my data ?

Also should I take as factor : Number of total reads ? Number of mapped reads ? Number of splice junctions ?

ADD REPLY • link 5.3 years ago by Bastien Hervé 5.3k

0

Entering edit mode

For the Fisher's exact test, you need to have counts. The normalisation is given by the background data. In your case it could be the total numbers of splice-reads in the 170 kbp area.

ADD REPLY • link 5.3 years ago by michael.ante ★ 3.8k

0

Entering edit mode

I've got counts, but Fisher's exact test is a test to see if I need to normalize my counts. If I got p-value < 2.2e-16, how can I normalize those counts ?

ADD REPLY • link 5.3 years ago by Bastien Hervé 5.3k

0

Entering edit mode

This p-value indicates if the differences in number of reads spanning a certain junction are likely to be a "true" difference due to the condition or merely random.

Using this approach would give a p-value for each fusion/recombination event, like in a differential expression analysis. I thought this was what you need.

ADD REPLY • link 5.3 years ago by michael.ante ★ 3.8k

0

Entering edit mode

Nah, let's say I have 2 conditions :

Condition 1) normal mouse
Condition 2) modified mouse to force the splicing events between two genes -> geneA and geneB

The forcing splice event only occurs between these two genes, everywhere else in my modified mouse should have the same amount of splicing events.

So I expect that the huge amount of splice events for modified mouse in this location (geneA/geneB) is due to my condition2 (I could here, use your Fisher's test to prouve that)

But, at the beginning, if I got more reads in condition2 than in my condition1, I will also, statistically, get more splicing events, which is not due to my condition. It's due to my total number of reads output from sequencing.

I want to normalize this biais ! I don't know if i'm clear enought

ADD REPLY • link 5.3 years ago by Bastien Hervé 5.3k

score 0 · Answer 1 · 2018-12-12

0

Entering edit mode

5.3 years ago

Kristoffer Vitting-Seerup ★ 4.0k

Why not jump to using one of the many tools made for this:

DEXSeq : Identifies differential exon usage
SUPPA2 or (r)MATS which are made for changes in splicing events
My R package IsoformSwitchAnalyzeR can help you identify and analyse alternative splicing and isoform switches with predicted functional consequences (such as gain/loss of protein domains etc) in both individual genes and on genome wide level.

ADD COMMENT • link 5.3 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

Actually, I'm not looking for differential exon usage but for gene recombinaison, around 170 000 bases. These tools are really exons oriented, can they catch this kind of events ?

ADD REPLY • link 5.3 years ago by Bastien Hervé 5.3k

1

Entering edit mode

Ahh - there are quite a few tools designed specifically for doing exactly that (many of which have been verified) which also mean that the statistical analysis can handle fx replicates. Unfortunately it is a bit outside of my expertise so I only know a few (e.g. this, this, this (which uses Kallisto) and this comparison ) - but if you write a new question asking for such tools you might get som more.

ADD REPLY • link 5.3 years ago by Kristoffer Vitting-Seerup ★ 4.0k

1

Entering edit mode

For fusion-gene detection, you can also use STAR-Fusion. But since your target genes are quite closely located (

let's say 200 000 pb

), you should optimise the parameters for the detected fusion distance.

ADD REPLY • link 5.3 years ago by michael.ante ★ 3.8k

0

Entering edit mode

Thanks, I used STAR first to map my reads so I was looking for something along STAR

But this do not answer the title of this thread :) What is the good method to normalize this kind of counts ?

ADD REPLY • link 5.3 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Thanks for those links, I'll read this comparison. STAR gave me an output of splicing region I'll start with that but I'm looking for the normalization method fitting my splicing event counts the best.

ADD REPLY • link 5.3 years ago by Bastien Hervé 5.3k