Question

Finding the cause for corrupt ChIP-Seq data

0

Entering edit mode

4.1 years ago

Aspire ▴ 330

I have received data where one (or more) of the steps ChIP-Seq has failed :

There was a large difference between the amount of reads generated for IP and for Input samples (3-5 million reads for IP, 20-30 million reads for Input).
The sequencing was paired-end. For the IP samples, R1 had a large amount of adapters (around 30% of the data did not pass the filtering stage, mainly due to adapter content). R2 did not have such a large percentage of adapters, but around 30% of the reads did not pass the filtering stage due to bad quality. Input samples did not have this problem.
When aligning, Read 2 of the IP samples did not have a higher percentage of unique alignment than 20%. ( R1 of the IP was uniquely aligned around 50% - 70%, with the input samples being aligned at a 75% rate.)
At the end, even for R1, there are only about 500K - 1 million reads uniquely aligned per each IP sample.

So, basically there is no data to work with.

What I would like to do however, is to understand whether the failure was (1) during the ChIP stage (2) During transit of the samples (the DNA was in transit around 10 days). (3) Library preparation.

Can you suggest how it can be checked, in case it can be?

Preparing the libraries anew / resequencing is currently not possible, unfortunately.

ChIP-Seq • 1.5k views

ADD COMMENT • link updated 4.1 years ago by colindaven 6.4k • written 4.1 years ago by Aspire ▴ 330

1

Entering edit mode

What are the read lengths of R1 and R2 (sounds like it was unequal) and what was the average size of the fragments that went into the library prep? Did you align R1 and R2 separately? What about these reads that did align. Can you call peaks with them and do results look at least somewhat normal on a genome browser (=can you see peaks)? Well, 10 days is long, was the content at 4°C or below or was it at room temperature? Did the libraries look good on a Bioanalyzer after library prep?

ADD REPLY • link 4.1 years ago by ATpoint 82k

0

Entering edit mode

It seems like not only he aligned R1 and R2 separately, but also did the quality and adapter trimming separately for R1 and R2.

ADD REPLY • link 4.1 years ago by h.mon 35k

0

Entering edit mode

Yes. TruSeq adapters were used, and these are different for R1,R2.

ADD REPLY • link 4.1 years ago by Aspire ▴ 330

0

Entering edit mode

The read lengths of both R1 and R2 are 43. I have aligned R1 and R2 separately (when I tried paired-end alignment, the percentage of unaligned reads was slightly bigger than the sum of the percentage of unaligned reads for R1 and R2 separately).

I have called peaks for some samples; from what I see so far, the maximum pileup for a peak is about 9 reads, with about 100 peaks. So FRiP is hardly existent :)

It is not clear and impossible to know whether the content was actually stored in a cool environment.

I do not yet understand the TapeStation graphs myself, but I was now told by the sequencing center that the libraries had sequences that were "too long" (need to find out how long exactly). Will update when this is more clear.

ADD REPLY • link 4.1 years ago by Aspire ▴ 330

0

Entering edit mode

I mostly work with single-end ChIP-seq, so I can't say much about some of your observations. For the Input getting a lot more reads than the IP, did you pool them on a single lane or sequence them separately? If pooled, did you just combine equal volumes or did you measure the concentrations and pool based on that?

High rates of duplication can indicate too many PCR cycles. If you needed to do a lot of cycles because of low concentration from IP, then you may need to optimize your shearing or ChIP stage, or just use a larger number of cells/more tissue from the start.

ADD REPLY • link 4.1 years ago by colin.kern ★ 1.1k

0

Entering edit mode

The samples were multiplexed and sequenced on the same lane. I do know that the concentrations were measured, and combined based on them. This is why the difference between IP and Input is surprising. Thanks.

ADD REPLY • link 4.1 years ago by Aspire ▴ 330

score 1 · Answer 1 · 2020-03-22

1

Entering edit mode

4.1 years ago

ATpoint 82k

TruSeq adapters were used, and these are different for R1,R2.

Actually they are the same for both R1 and R2. The sequence to trim is AGATCGGAAGAGC. If you trim data then use a trimmer in paired-end mode. Choices are e.g. cutadapt, bbduk.sh, skewer, fastp, etc...please do not use fastx-toolkit. Also, you should always align paired-end data together in paired-end mode. All aligners support that. It is actually impossible that R1 and R2 have different amounts of adapter contaminations if the reads are equally long. Did you run fastqc? Please do not do custom alignment like aligning R1 and R2 separately and then do any kind of merges lateron, that is meaningless. I suggest you redo the analysis up to this point, trimming and aligning in PE mode and then repeat the peak calling.

The different read numbers for the conditions are not necessarily a problem with the libraries, this could also be poor quantification/pooling prior to sequencing. Did you pool the final library or did the seq. facility do it based on the individual libraries? Can you show the TapeStation track and the fastqc result towards the adapter contamination?

ADD COMMENT • link 4.1 years ago by ATpoint 82k

0

Entering edit mode

Is trimming AGATCGGAAGAGCACACGTCTGAACTCCAGTCA for R1, and AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT for R2 (as instructed by the Illumina manual) wrong?

Even though the trimming was done separately (with cutadapt), the resulting reads (matched anew to each other after the trimming) were aligned in paired-end mode.

It is actually impossible that R1 and R2 have different amounts of adapter contaminations if the reads are equally long.

In my case, already in BaseSpace it could be seen that R1 and R2 had different amounts of adapter contamination. Why would that impossible?

ADD REPLY • link 4.1 years ago by Aspire ▴ 330

0

Entering edit mode

In my understanding, if you have lets say (simplyfied) a fragment of 100bp (so only the actual DNA you are interested in) and you go for example 150bp in R1 and R2 then you pick up adapter content in both reads starting from cycle 101 on. So if read length is the same you should see equal adapter content in both reads. Still, typically you sonicate to get fragments around 150-300bp for ChIP, so I am actually surprised you even have adapter contamination at 2x40bp. It seems to me this entire library is pretty messed up. I would really need to see the TapeStation pictures to judge if this agrees with the sequencing results. Adapter content (so all adapter base pairs takn together) is around 100bp, so if you have notable adapter content then the tapestation result should show an average size below 200bp or so. Difficult to tell from remote, ChIP is a pain, I went through that experience very recently :)

ADD REPLY • link 4.1 years ago by ATpoint 82k

0

Entering edit mode

Hi, just to update : The samples were resequenced, and the results were the same. R1 shows adapter content while R2 shows a lack of signal followed by polyT. Bottom line is that it's clear that the libraries have failed.

ADD REPLY • link 3.9 years ago by Aspire ▴ 330

score 1 · Answer 2 · 2020-03-23

For really messed up libraries you can also QC without making any assumptions as a test to get % alignments.

Sometimes I just use bwa-mem to align both pairs without adapter trimming or anything else.

Then use samtools flagstat and samtools stats followed by multiqc to try to work out what's going on.