ChIP-Seq samples : FASTQ quality control
1
0
Entering edit mode
6.1 years ago
erwan.scaon ▴ 930

I stumbled across something I never saw before with raw FASTQ files during QC. We are working with ChIP-Seq samples (Mutant (MAR) vs WT, 3 replicates each, 1 IP and 1 Input (IN) for each replicates for a total of 12 samples).

A pattern is emerging while looking at mean quality scores per position in the raw reads : First 35 bp are somewhat low-quality, then there is a sudden increase in quality until the end of the read (75 bp), see this plot : raw_mean_qual_scores_multiqc Here is what I know about the samples :
Samples were prepared using "TruSeq® ChIP Library Preparation Kit" (with 1% PhiX Control) and sequenced with Illumina NextSeq (single end 75 bp).
During samples preparation, after several steps, a point was reached with ~ 10M cells in 350uL. Then, 50uL were used for Input (~ 1.4M cells) & 2x150uL (~ 2x4.3M cells) for IP (one IP directed against the protein of interest and one non-specific IP for background noise).

Any suggestions for processing those reads ? Should I trim the first 35 bp regardless of the sample ?

Edit

I allow myself to edit the OP post to take into account all the good suggestions made in the replies. Especially regarding @Devon post to take depth bias into account.

Another thing to know about these samples is the very uneven number of raw reads between them (thus the importance to account for depth bias, which was not done previously) : general_stats_multiqc

The 5 samples which seem to have the best raw_mean_qual_scores_plot are WT_IP_3, WT_IP_2, MAR_IN_2, WT_IN_3 and MAR_IN_3. They also are the samples with the highest number of raw reads.

These 5 samples happen to be the one considered as potential outliers on PCA plots generated from read count coming from un-normalized BAMs.

New PCA

Below are the relevant options used in cmds leading to a PCA plot on read counts from normalized BIGWIGs (mm10 normalizeTo1x) using deepTools 3.0.1 (using raw reads, no trimming nor duplicates removal) :

bamCoverage --binSize 10 \
            --effectiveGenomeSize 2150570000 \
            --normalizeUsing RPGC \
            --ignoreForNormalization chrX;

multiBigwigSummary bins \
                   --chromosomesToSkip chrX;

Nb : I tried multiBigwigSummary with --binSize 10 but it was taking way too long

plotPCA --transpose;

transposed_normalized_PCA

2nd PCA plot without the 2 potential outliers : transposed_normalized_PCA_bis

ChIP-Seq QC • 4.0k views
ADD COMMENT
2
Entering edit mode

Should I trim the first 35 bp regardless of the sample ?

Not off the bat. Have you tried to align the data as is? My guess is that should be good align-able data. One reason Q scores suffer is when basecaller perceives low-diversity sequence appearing in many clusters. In ChIPseq one would expect to see a specific sequence getting enriched but to have that as the most predominant sequence in all samples in first 35 cycles (to the extent that it caused the Q scores to drop) seems a bit far fetched.

ADD REPLY
0
Entering edit mode

Yes data was mapped as is vs reference genome : multiqc.

Thing is half my samples do not reach 50% uniquely mapped reads.

star multiqc

ADD REPLY
0
Entering edit mode

There may be an experimental explanation for what you are observing. Have all samples been QC'ed/treated the same way?

ADD REPLY
0
Entering edit mode

Yes all samples have been treated the same way on our side.

I asked if samples were on differents Illumina runs to those concerned, answer is no.

ADD REPLY
1
Entering edit mode
6.1 years ago

Since you're using STAR anyway, you can use the soft-clipping functionality from it to avoid needing to trim. I would, however, adjust --outFilterNminOverLread to a lower value, like 0.5 (this will allow STAR to output alignments where the read is more heavily soft-clipped). Since the bias in the signal should be random, that shouldn't cause excessive bias. My hope is that this will start to make the alignment metrics more comparable across samples, regardless of how crappy the first 35 bases are. Worst case scenario you can lop off the first 35 bases or so from all of the samples, but that should be avoided if possible, especially if some of you're IPing repressive marks (I assume not, given the MultiQC output, but figured I'd mention it).

Once you have more similar alignment metrics, it'd be interesting to see fingerprint plots (e.g., with plotFingerprint from deepTools) for these samples. I worry that the sequencing QC issues you're seeing are a manifestation of some weird underlying library prep issue. Looking at fingerprints (run multiBamSummary and then plotCorrelation from deepTools too) might help elucidate whether this is the case, or whether you can safely proceed in with the analysis.

Now you're probably wondering why you're seeing this weird quality step-change in some samples but not others. I have simply no good explanation for that. My presumption is that these samples were actually on different runs, but even that's a bad explanation (e.g., the step-change in quality wouldn't be coordinated then).

ADD COMMENT
1
Entering edit mode

Would be interesting to see a PCA plot of these samples. There is some pattern in there and it may become apparent then.

ADD REPLY
1
Entering edit mode

Agreed, though at least from eye balling things it seems that the STAR output correlates with the FastQC metrics.

@Erwan.scaon: You can use plotPCA (again, from deepTools) on the MultiBamSummary results. Please use the --transpose option with plotPCA if you do that.

ADD REPLY
0
Entering edit mode

MultiBamSummary is running.
I don't find the --transpose option in plotPCA 2.5.7. I see it's mentionned here : github, but how do I use it ?

Nb : This plot will be done on initial STAR runs (without --outFilterMatchNminOverLread set to 0.5 instead of 0.66), I hope it's fine.

Nb : plotCorrelation & plotFingerprint incoming too (after new STAR run with --outFilterMatchNminOverLread set to 0.5).

ADD REPLY
0
Entering edit mode

--transpose is in 3.0, which is what you should get if you just pip install or conda install things.

ADD REPLY
0
Entering edit mode

On initial STAR runs :
Using multiBamSummary 2.5.7 (without --ignoreDuplicates) :

multiBamSummary bins \
                -b *.bam \
                -o coverage_matrix.npz \
                -p max;

After cloning latest deeptools :

git clone https://github.com/deeptools/deepTools --branch develop

Then using /deeptools/plotPCA.py (because /bin/plotPCA -h doesn't show --transpose option) :
Using plotPCA.py 3.0.1-22-ca45a47 :

sudo python plotPCA.py --transpose \
                       -in ~/Desktop/coverage_matrix.npz \
                       -o ~/Desktop/PCA_readCounts.png \
                       -T "PCA of read counts" \
                       --colors grey black grey grey grey grey red red red red blue red

The result (seems pretty ugly) :
plotPCA

Nb : I had an error if not using the --colors option with plotPCA :

Traceback (most recent call last): File "plotPCA.py", line 189, in <module>
main()
File "plotPCA.py", line 176, in main
cols=args.colors)
File "/usr/local/lib/python2.7/dist-packages/deeptools/correlation.py",line 653, in plot_pca
color = pltcolors.to_hex(color, keep_alpha=True)
AttributeError: 'module' object has no attribute 'to_hex'

ADD REPLY
0
Entering edit mode

I'll have a look at the to_hex() issue, not sure what that's about. You definitely have two samples that are obvious outliers given this.

ADD REPLY
0
Entering edit mode

Considering MAR_IN_2 and WT_IP_2 as outliers given the above PCA_plot, I did ran multiBamSummary and plotPCA again without them :

pca

One striking thing is that this new PCA seems to highlight 3 additionals outliers : MAR_IN_3, WT_IN_3 and WT_IP_3.
Those samples, like MAR_IN_2 and WT_IP_2 are amongst the best in the Q_score_plot, as pointed out by @genomax.

Indeed, the top 5 samples on the Q_score_plot are :
WT_IP_3
WT_IP_2
MAR_IN_2
WT_IN_3
MAR_IN_3

ADD REPLY
1
Entering edit mode

The dark triangle in the PCA bi-plot looks like an outlier. I suggest to remove it.

ADD REPLY
0
Entering edit mode

STAR alignement with --outFilterNminOverLread 0.5 did not affects things apparently (no visible impact on the STAR MultiQC plot)

star_multiqc_outFilterNminOverLread

ADD REPLY
0
Entering edit mode

Are MAR_IN_2 and WT_IP_2 the outliers (hard to tell from the pic above). They seem to have very similar profiles in alignments.

ADD REPLY
0
Entering edit mode

I edited the plot with new colors set to make things clear.

Yes that was MAR_IN_2 and WT_IP_2

ADD REPLY
1
Entering edit mode

Interestingly they don't seem to have the strange Q score issue if I am reading the plot in the original post. Not sure what is going on with MAR_IP_2 in alignment plot above. That looks like an outlier there.

ADD REPLY
0
Entering edit mode

Right, they are amongst the best samples regarding this metric on the Q_score_plot.

It puzzles me that they end up being seen as outliers.

Regarding MAR_IP_2 in the alignment_plot, not even the STAR run with --outFilterMatchNminOverLread 0.5 could improve it. It's the sample with the worst QC report, I guess it's fair that it has the worst alignement stats too.

ADD REPLY
1
Entering edit mode

If the crappy quality ones all cluster together then the good ones will appear to be outliers. I also forgot about the fact that there's going to be a depth bias with this. I usually use MultiBamSummary with bigWig files for things like that, but forgot that step. It could be interesting to do that (you can use bamCoverage to make bigWig files from your BAM files) and see if that clarifies things.

ADD REPLY
0
Entering edit mode

I edited the OP post with all relevant infos & a new PCA based on normalized BIGWIGs.

ADD REPLY
1
Entering edit mode

If you end up having to take those two samples out does that break your experimental design. Things will no longer be balanced. That GC% is weird too. MAR_IP_2 is special.

ADD REPLY
0
Entering edit mode

Removing MAR_replicate_2 & WT_replicate_2 because of MAR_IP_2 and WT_IN_2 would leave me with 2 mutant and 2 WT replicates, which may be fine.

ADD REPLY
1
Entering edit mode

That's probably worth a go then. Given the PCA with those removed I'm not holding my breath, but perhaps this is picking up something like blacklisted regions (though there aren't terribly many blacklisted regions in GRCm38).

ADD REPLY
1
Entering edit mode

A) To check if blacklisted regions have an impact, I did remove them using :
Bedtools2.27.1
Blacklist

bedtools intersect -v -a file.bam -b blacklist.bed

It did not significantly impact the PCA (despite removing a non-negligible amount of alignments).

B) I then tried to remove duplicates on the blacklisted filtered BAM using :

samtools rmdup

It had an impact on PCA, see below :
(Raw reads aligned with STAR vs GRCm38.p6, blacklist filtered, dups removed, BIGWIG normalized to 1x (RPGC), multiBigwigSummary bins, plotPCA --transpose)

PCA with 12 samples :

pca

11 samples (removed what seem to be an outlier : WT_IN_2)

pca

10 samples (removed WT_IP_1)

pca

C) Associated fingerprints (as suggested by @Devon Ryan)
plotFingerprint 3.0.1

Fingerprint for :

All fingerprint

MAR_1 fingerprint

MAR_2 fingerprint

MAR_3 fingerprint

WT_1 fingerprint

WT_2 fingerprint

WT_3 fingerprint

This is very different from what we see in the doc. There is a large proportion of the genome that is not sequenced. IP and IN are often hard to distinguish. It is expected that the IP only target few regions on the genome (IP is targeted vs AID enzyme)

The choices to make are still pretty unclear to me. I'll stop bothering you guys with this dataset. I'll try my best. Thanks for all your help, advice & suggestions

ADD REPLY
0
Entering edit mode

The input looks terrible here, either it's vastly under sequenced or there's some other sort of bias to it.

ADD REPLY
0
Entering edit mode

Yes I guess I'll try MACS2 + Diffbind with 2 replicates for MAR & WT (despite 2 WT samples looking like outliers on the 2nd PCA).

The issue may be that I did not filter my data ? Maybe with duplicates & blacklisted regions removed, thing would be better ? I did not bother with this because :
1) I saw that MACS2 auto filters tags, keeping only like 5-10% of them before calling peaks, which <=> dups removal.
2) Thought that poor quality reads would not map with STAR anyway (used STAR because the Biostars handbook advise to use a gapped aligner for ChIP-Seq).
3) I have no good reason for not removing blacklisted regions.

ADD REPLY
0
Entering edit mode

I'm surprised the handbook says to use STAR, one usually uses an ungapped aligner for ChIPseq data since it can't be spliced. I mean, you can make STAR do that too and its faster than bowtie2/bwa, but that's nonetheless odd. It's worth a try to filter and then recheck this, but I doubt the difference will be stark. Have you run plotFingerprint on these? If that's vastly different (except between input and ChIP) then I wouldn't even bother with peak calling. Note that you may need 3 samples per group for diffbind to work, it's quite picky.

ADD REPLY
0
Entering edit mode

In the "chip-seq-intro" of the biostarhandbook, under the asked question : "Should my ChIP-Seq aligner be able to detect INDELs?", we can read : "As a rule, aligners that cannot detect insertions and deletions should NEVER be used under ANY circumstances - unless no other aligner can be utilized (very rare cases where other aligners would not work at all)."

This is my bad, the biostarhandbook is not telling us to use STAR or a splice-aware aligner, just one that can handle InDel.

I'll try to revisit my alignments with another tool if needed.

I did not run plotFingerprint yet.

Diffbind is OK with only 2 replicates when using dba.contrast(aid, categories=DBA_CONDITION, minMembers=2)

ADD REPLY

Login before adding your answer.

Traffic: 2655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6