RNA-seq High Variation Between and Within Groups
0
0
Entering edit mode
5.7 years ago

I have a simple two group experiment. Treatment Vs. Control, 4 biological replicates. The design is 150bp. Read depth is 100 million reads per sample, paired end. Each sample has a ERCC spike-in for post processing quality assessment. Pipeline is as follows->

  1. Run FastQC on each fastq file.
  2. Quality looks excellent, but Illumina adapters are present.
  3. Paired-End trimmomatic run to remove adapter sequences.
  4. FastQC is run again to verify adapter removal.
  5. Lanes are merged using bash cat.
  6. Tophat2 is run in paired end mode to align reads.

Command used:

subprocess.call(['tophat2','-p','64','-r','10','--mate-std-dev','51','--mate-inner-dist','-101','--b2-very-sensitive','--no-novel-juncs','--microexon-search', '--coverage-search','--library-type','fr-secondstrand','--output-dir',output_folder,'--GTF',gtf_file,bowtie2_builds,leftfileString, rightfileString])

Summary files look okay with ~70-75% concordant alignment.

HTseq is then used to count reads. So now I construct a read count table and the data are highly variable both within and between groups.

Green (Vehicle treated), Orange (Drug Treated), scale is log2!

Does anyone have an idea about what could cause this? Thanks!

For context, here are the box plots from a different experiment I ran a few months back:

RNA-Seq htseq Tophat2 • 1.5k views
ADD COMMENT
0
Entering edit mode

See: How to add images to a Biostars post

You need the direct link to the image, not the link to the page that hosts the image.

ADD REPLY
0
Entering edit mode

What do the box-plots show? You didn't tell us.

As you are finding the results strange, I would advise to not concatenate the lanes, but rather map and count them separately. Have all samples been sequenced to approximately the same depth? Besides lane, are there other batch effects?

On the other hand, this may be true biological variance, some experiments have higher variances and lower fold-changes than others.

ADD REPLY
0
Entering edit mode

Sorry. The box plots show the distribution of read counts per replicate. The first four are the vehicle treated replicates and the last four are drug treated. The desired read depth was 100M but what we received varied between 65M and 120M per replicate. There doesn't appear to be any correlation between read depth and variation.

ADD REPLY
0
Entering edit mode

Just a side comment: are yo sure that your data is forward-stranded? option --library-type fr-secondstrand. Most popular RNAseq lib. prep. protocols today are reverse-stranded.

ADD REPLY
0
Entering edit mode

I was originally using QoRTs for read counting and it complained that the strandedness might be different than specified so I used Chipster to verify that it is fr-secondstranded. The runs I had done in the past were firststranded, but I think the sequencing facility has upgraded their platform.

ADD REPLY
0
Entering edit mode

What is the y-axis? Please provide some more details.

ADD REPLY
0
Entering edit mode

Y axis is (RLE = log-ratio of read count to median read count across sample). These were generated using RUVseq which is software that uses the ERCC spike-ins to nomalize the data. The data in the figure is unnormalized.

ADD REPLY

Login before adding your answer.

Traffic: 2747 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6