Question

sc-WBGS library GC content, low bismark mapping rate, high CHH%

0

Entering edit mode

6.7 years ago

Jingyue ▴ 70

Hi, I have met some problem in analysis sc-WGBS dataset. The single cell WGBS experimental protocol was followed the nature protocol: Smallwood et.al. This is in Bovine, we used the genome UMD3.1.1

This is the fast-QC GC-content before TrimGalore

After the first 12bp were trimmed off in 5' end

The C% does not fall to 0, in some samples can be as high as 10%.

Mapping rate summary

We had a very low mapping rate 1.8% to 22.5%, and some sample has high methylation C in CHH 1.9% to 29.9%, which normally should be 1-2%.

Here is the code we used in Bismark:

bismark \
${genome_folder} \
--multicore 8 \
--non_directional \
--score_min L,0,-0.6 \
-I 0 -X 1000 \
--un \
-1 ${line}_1_val_1.fq.gz \
-2 ${line}_2_val_2.fq.gz

Final Alignment report
======================
Sequence pairs analysed in total:   41705524
Number of paired-end alignments with a unique best hit: 1240792
Mapping efficiency: 3.0%
Sequence pairs with no alignments under any condition:  40379911
Sequence pairs did not map uniquely:    84821
Sequence pairs which were discarded because genomic sequence could not be extracted:    169

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT:   112057  ((converted) top strand)
GA/CT/CT:   122032  (complementary to (converted) top strand)
GA/CT/GA:   907980  (complementary to (converted) bottom strand)
CT/GA/GA:   98554   ((converted) bottom strand)

Final Cytosine Methylation Report
=================================
Total number of C's analysed:   52808430

Total methylated C's in CpG context:    2562392
Total methylated C's in CHG context:    2974558
Total methylated C's in CHH context:    13849485
Total methylated C's in Unknown context:    176


Total unmethylated C's in CpG context:  4258477
Total unmethylated C's in CHG context:  6975731
Total unmethylated C's in CHH context:  22187787
Total unmethylated C's in Unknown context:  1372


C methylated in CpG context:    37.6%
C methylated in CHG context:    29.9%
C methylated in CHH context:    38.4%
C methylated in unknown context (CN or CHN):    11.4%

We don't know if this is because of our library prep problem or something wrong with our data analysis. Any experience about this or good suggestions?

Thank you for your help!

wgbs gc content methylation • 3.4k views

ADD COMMENT • link updated 6.7 years ago by Felix Krueger ▴ 40 • written 6.7 years ago by Jingyue ▴ 70

score 1 · Accepted Answer · 2017-08-17

Hi Ellie,

The FastQC plot before trimming looks familiar, I suppose 9bp trimming would have been sufficient but 12bp can't hurt. I suppose you also ran standard adapter and quality trimming as well?

The amount of C is not expected to drop to 0% in this case because non-directional libraries have a fair amount of read alignments to all four strands.

The mapping rate of paired-end libraries constructed following the scBS protocol is fairly poor for because a) the protocol generates a high amount of chimaeric reads (https://sequencing.qcfail.com/articles/pbat-libraries-may-generate-chimaeric-read-pairs/), b) there I believe there were some polyT sequences in the read and c) I think there were some rather high levels of an illumina primer construct in the libraries as well. If you add the following contamination to the FastQC adapter contamination file you can track this contaminant as well.

Illumina Paired End PCR Primer 2 CTACACGACGC

We saw that all these issues together typically resulted in mapping efficiencies of ~10-30% only. Also, because of the chimaeric read problem we tend to align sc reads in single-end mode only, maybe you should try that as well?

Finally, have you checked if the samples contain contaminants, such as human cells, PhiX or other typical sequences you are handling in your lab? Maybe there is some cross-contamination going on? FastQ_screen can be run in ---bisulfte mode to screen several different genomes, or you could just take a small subset of samples and reads and align them to the genomes in question.

I notices that you used fairly relaxed setting for --score_min. The UMD3.3 genome isn't as good as human or mouse, but by allowing quite relaxed alignments you also increase the chances of mis-alignments, and these tend to increase methylation levels in any context dramatically. Maybe L,0,-0.4 would help keep mis-alignments down a bit.

Just as a side note:Increasing the maximum fragment size to 1000bp is unnecessary for most BS-Seq applications because virtually all fragments are smaller than 200bp, so it really only increases the processing time.

Other than that I can't see anything from the processing side that I would have done differently to be honest. Overall you seem to have a relatively high level of CpG and low level of non-CG methylation, as expected, even if with 3-8% it seems already a little on the higher side. The real outliers seem to have the worst overall mapping effiencies, so misalignments would have a bigger say in average levels as the ones you are reporting here. It is probably still worth looking at the your results, try to see if there are regions with a stupidly high coverage and exclude those from further analysis. There is good chance that removing these regions would bring your average methylation levels down somewhat.

I hope this will give you a few pointers, all the best, Felix