sc-WBGS library GC content, low bismark mapping rate, high CHH%
1
0
Entering edit mode
6.7 years ago
Jingyue ▴ 70

Hi, I have met some problem in analysis sc-WGBS dataset. The single cell WGBS experimental protocol was followed the nature protocol: Smallwood et.al. This is in Bovine, we used the genome UMD3.1.1

This is the fast-QC GC-content before TrimGalore before trim5end

After the first 12bp were trimmed off in 5' end after_trim5end

The C% does not fall to 0, in some samples can be as high as 10%.

Mapping rate summary

mapping_rate_summary

We had a very low mapping rate 1.8% to 22.5%, and some sample has high methylation C in CHH 1.9% to 29.9%, which normally should be 1-2%.

Here is the code we used in Bismark:

bismark \
${genome_folder} \
--multicore 8 \
--non_directional \
--score_min L,0,-0.6 \
-I 0 -X 1000 \
--un \
-1 ${line}_1_val_1.fq.gz \
-2 ${line}_2_val_2.fq.gz

Final Alignment report
======================
Sequence pairs analysed in total:   41705524
Number of paired-end alignments with a unique best hit: 1240792
Mapping efficiency: 3.0%
Sequence pairs with no alignments under any condition:  40379911
Sequence pairs did not map uniquely:    84821
Sequence pairs which were discarded because genomic sequence could not be extracted:    169

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT:   112057  ((converted) top strand)
GA/CT/CT:   122032  (complementary to (converted) top strand)
GA/CT/GA:   907980  (complementary to (converted) bottom strand)
CT/GA/GA:   98554   ((converted) bottom strand)

Final Cytosine Methylation Report
=================================
Total number of C's analysed:   52808430

Total methylated C's in CpG context:    2562392
Total methylated C's in CHG context:    2974558
Total methylated C's in CHH context:    13849485
Total methylated C's in Unknown context:    176


Total unmethylated C's in CpG context:  4258477
Total unmethylated C's in CHG context:  6975731
Total unmethylated C's in CHH context:  22187787
Total unmethylated C's in Unknown context:  1372


C methylated in CpG context:    37.6%
C methylated in CHG context:    29.9%
C methylated in CHH context:    38.4%
C methylated in unknown context (CN or CHN):    11.4%

We don't know if this is because of our library prep problem or something wrong with our data analysis. Any experience about this or good suggestions?

Thank you for your help!

wgbs gc content methylation • 3.4k views
ADD COMMENT
1
Entering edit mode
6.7 years ago

Hi Ellie,

The FastQC plot before trimming looks familiar, I suppose 9bp trimming would have been sufficient but 12bp can't hurt. I suppose you also ran standard adapter and quality trimming as well?

The amount of C is not expected to drop to 0% in this case because non-directional libraries have a fair amount of read alignments to all four strands.

The mapping rate of paired-end libraries constructed following the scBS protocol is fairly poor for because a) the protocol generates a high amount of chimaeric reads (https://sequencing.qcfail.com/articles/pbat-libraries-may-generate-chimaeric-read-pairs/), b) there I believe there were some polyT sequences in the read and c) I think there were some rather high levels of an illumina primer construct in the libraries as well. If you add the following contamination to the FastQC adapter contamination file you can track this contaminant as well.

Illumina Paired End PCR Primer 2 CTACACGACGC

We saw that all these issues together typically resulted in mapping efficiencies of ~10-30% only. Also, because of the chimaeric read problem we tend to align sc reads in single-end mode only, maybe you should try that as well?

Finally, have you checked if the samples contain contaminants, such as human cells, PhiX or other typical sequences you are handling in your lab? Maybe there is some cross-contamination going on? FastQ_screen can be run in ---bisulfte mode to screen several different genomes, or you could just take a small subset of samples and reads and align them to the genomes in question.

I notices that you used fairly relaxed setting for --score_min. The UMD3.3 genome isn't as good as human or mouse, but by allowing quite relaxed alignments you also increase the chances of mis-alignments, and these tend to increase methylation levels in any context dramatically. Maybe L,0,-0.4 would help keep mis-alignments down a bit.

Just as a side note:Increasing the maximum fragment size to 1000bp is unnecessary for most BS-Seq applications because virtually all fragments are smaller than 200bp, so it really only increases the processing time.

Other than that I can't see anything from the processing side that I would have done differently to be honest. Overall you seem to have a relatively high level of CpG and low level of non-CG methylation, as expected, even if with 3-8% it seems already a little on the higher side. The real outliers seem to have the worst overall mapping effiencies, so misalignments would have a bigger say in average levels as the ones you are reporting here. It is probably still worth looking at the your results, try to see if there are regions with a stupidly high coverage and exclude those from further analysis. There is good chance that removing these regions would bring your average methylation levels down somewhat.

I hope this will give you a few pointers, all the best, Felix

ADD COMMENT
0
Entering edit mode

Hi, Felix, Thank you very much for your reply. I tried single-mode to align in Bismark, the mapping rate does not increase too much, only 1-3%. I also tried L,0,-0.4, the mapping rate in one library dropped from 10.6% to 7.6%, but the non-CG methylation only dropped from 7% to 6% still is high. Do you recommend me to use the filter_non_conversion to remove those libraries with high methylation in non-CpG? I think bovine should like human that non-CG context is very low.

ADD REPLY

Login before adding your answer.

Traffic: 2899 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6