Question

RNA-seq High Variation Between and Within Groups

0

Entering edit mode

5.7 years ago

jacobsen.jeremy ▴ 40

I have a simple two group experiment. Treatment Vs. Control, 4 biological replicates. The design is 150bp. Read depth is 100 million reads per sample, paired end. Each sample has a ERCC spike-in for post processing quality assessment. Pipeline is as follows->

Run FastQC on each fastq file.
Quality looks excellent, but Illumina adapters are present.
Paired-End trimmomatic run to remove adapter sequences.
FastQC is run again to verify adapter removal.
Lanes are merged using bash cat.
Tophat2 is run in paired end mode to align reads.

Command used:

subprocess.call(['tophat2','-p','64','-r','10','--mate-std-dev','51','--mate-inner-dist','-101','--b2-very-sensitive','--no-novel-juncs','--microexon-search', '--coverage-search','--library-type','fr-secondstrand','--output-dir',output_folder,'--GTF',gtf_file,bowtie2_builds,leftfileString, rightfileString])

Summary files look okay with ~70-75% concordant alignment.

HTseq is then used to count reads. So now I construct a read count table and the data are highly variable both within and between groups.

Green (Vehicle treated), Orange (Drug Treated), scale is log2!

Does anyone have an idea about what could cause this? Thanks!

For context, here are the box plots from a different experiment I ran a few months back:

RNA-Seq htseq Tophat2 • 1.5k views

ADD COMMENT • link updated 5.7 years ago by h.mon 35k • written 5.7 years ago by jacobsen.jeremy ▴ 40

0

Entering edit mode

See: How to add images to a Biostars post

You need the direct link to the image, not the link to the page that hosts the image.

ADD REPLY • link 5.7 years ago by Ram 43k

0

Entering edit mode

What do the box-plots show? You didn't tell us.

As you are finding the results strange, I would advise to not concatenate the lanes, but rather map and count them separately. Have all samples been sequenced to approximately the same depth? Besides lane, are there other batch effects?

On the other hand, this may be true biological variance, some experiments have higher variances and lower fold-changes than others.

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

Sorry. The box plots show the distribution of read counts per replicate. The first four are the vehicle treated replicates and the last four are drug treated. The desired read depth was 100M but what we received varied between 65M and 120M per replicate. There doesn't appear to be any correlation between read depth and variation.

ADD REPLY • link 5.7 years ago by jacobsen.jeremy ▴ 40

0

Entering edit mode

Just a side comment: are yo sure that your data is forward-stranded? option --library-type fr-secondstrand. Most popular RNAseq lib. prep. protocols today are reverse-stranded.

ADD REPLY • link 5.7 years ago by grant.hovhannisyan ★ 2.6k

0

Entering edit mode

I was originally using QoRTs for read counting and it complained that the strandedness might be different than specified so I used Chipster to verify that it is fr-secondstranded. The runs I had done in the past were firststranded, but I think the sequencing facility has upgraded their platform.

ADD REPLY • link 5.7 years ago by jacobsen.jeremy ▴ 40

0

Entering edit mode

What is the y-axis? Please provide some more details.

ADD REPLY • link 5.7 years ago by ATpoint 81k

0

Entering edit mode

Y axis is (RLE = log-ratio of read count to median read count across sample). These were generated using RUVseq which is software that uses the ERCC spike-ins to nomalize the data. The data in the figure is unnormalized.

ADD REPLY • link 5.7 years ago by jacobsen.jeremy ▴ 40