Using featureCounts and DESeq2 to look at differences between ATAC-seq conditions.
1
5
Entering edit mode
5.5 years ago
a.rex ▴ 350

Hello,

I am new to ATAC-analysis, and wish to ask for some advice in identifying significant differences between two ATAC-seq samples.

Firstly, I have two conditions - control and knockdown. I have mapped the PE reads to a reference genome, and then subsequently taken reads lower than 100bp (i.e. open chromatin regions). I then use MACS2 to turn my two un-normalised BAM files to peaks using this command:

macs2 callpeak -t macs2 callpeak -t bamfile --outdir /path/to/ -f BAMPE --keep-dup all --pvalue 1e-2 --call-summits --bdg

I then use feature counts to quantify mapped sequencing reads my 'peak' genomic features.

awk 'OFS="\t" {print $1"."$2+1"."$3, $1, $2+1, $3, "."}' peaks.narrowPeak > contol.peaks.saf 

featureCounts -a peaks.saf -F SAF -T <int> -p -o control.peaks_countMatrix.txt control.lessthan100bp.bam

So I result in two matrix outputs from feature counts: control.peaks_countMatrix.txt; and knockdown.peaks_countMatrix.txt

I am not used to using DESeq2 for differential analysis (have been using Kallisto/Sleuth for RNA-seq). Am I right in thinking that the two matrix files can be inputted to DESeq2, to peform differential analysis against? Does anyone have a pipeline that they could share in getting this done?

ATAC-seq next-gen R • 8.4k views
ADD COMMENT
5
Entering edit mode
5.5 years ago
ATpoint 81k

I would call peaks with the command:

macs2 callpeak -t bamfile --nomodel --keep-dup=all

The --nomodel turns off the ChIPseq-specific shifting model. Do not use the p-value argument, as correction for multiple testing prevents you from excessive numbers of false-positives. This excessive number of peaks will contain lots of regions with small counts (because of the relaxed cutoff) which lack the statistical power to call differential accessability. Better use the default q-value cutoff of 0.05 or even 0.01. As you already have the sub-100bp reads, I think the --call-summits option is neither necessary, nor beneficial. Most peaks will already be around 100-200bp.

There are multiple options how to call the peaks. Either call them one the merged BAM files of one condition, or call on every replicate and then take the intersect for each condition. There is no standard for this. I would probably call peaks with the above command and a q-value of 0.01 on each single file (I hope you have replicates, if not there is no point in using DESeq2), and then take the intersect between the files of one condition. You'll end up with one peak set for each condition, which is then to be merged (bedtools merge) and used as input for featureCounts to count the reads. DESeq2 takes one count matrix which contains the counts for each single replicate over one peak set (the merge between the conditions). From there on, you can pretty much follow the DESeq2 manual. It contains all the necessary code and is outstandingly extensive and informative.

ADD COMMENT
0
Entering edit mode

Thank you for this answer - how would you intersect the replicate narrow peak files? I then merge the two consensus peak files (corresponding to each condition) with bedtools merge?

ADD REPLY
1
Entering edit mode

Intersection can be done with bedtools intersect, and then merge the output with bedtools merge.

For the merge:

cat peakset1.bed peakset2.bed | sort -k1,1 -k2,2n | bedtools merge -i - > merged.bed
ADD REPLY
0
Entering edit mode

And, does macs2 need to be given a BAMPE option if I did PE sequencing?

ADD REPLY
1
Entering edit mode

Oh yeah, sorry I forgot that in my command. I always use -f BAMPE to make use of the paired-end data.

ADD REPLY
0
Entering edit mode

This gives me a bed file with consensus intervals that is all. You cannot differentiate between origin of peaks as well as no scores are given..

ADD REPLY
0
Entering edit mode

Yes. That is normal and intended. This file will serve as reference for featureCounts. The output DESeq2 will give you then are regions that are either more or less accessable. Only the read counts per region matter, not the origin of the peak or its macs score. This is a completely normal and standard way of doing differential analysis in this context. Why do you think you need more information?

ADD REPLY
0
Entering edit mode

I need to know what is more of less accessible between my sample versus control? So in featurecounts do I first run: featureCounts -a peaks.bed -F bed -T <int> -p -o control.peaks_countMatrix.txt control.lessthan100bp.bam; and then my featureCounts -a peaks.bed -F bed -T <int> -p -o sample.peaks_countMatrix.txt sample.lessthan100bp.bam. Then merge the matrix, and then perform DESeq2? I apologise if I am missing something....

ADD REPLY
5
Entering edit mode

It is much simpler. Look: In the merged peak file, all the peaks that are present in any sample are included. featureCounts is then used to count the read over all these regions for all samples and all conditions. If a region is closed it will get a low read count, if it is open, the count will be higher. What DESeq2 then does is to check if there are regions that are statistically significant (more or less accessable) between your conditions. Lets say you have three replicates per condition and for a specific peak the counts:

Control (3-5-2) and Condition-1 (243-223-230), then this would most likely come out as a region that is significantly more open/accessable in Condition-1 compared with Control. This is basically the same as you do in RNA-seq analysis. In this, you count the reads over all the genes across your conditions (but the reference, so the genes) are the same. The only difference with ATAC-seq is that you count over all the peaks across all conditions.

ADD REPLY
0
Entering edit mode

Okay I understand. Unfortunately, however, when I try cat peakset1.bed peakset2.bed | sort -k1,1 -k2,2n | bedtools merge -i > merged.bed, the merged.bed cannot be used as input to feature counts. I tried also converting to GTF.

ADD REPLY
0
Entering edit mode

There is a - missing after -i to indicate that the stream comes from stdin. You'll also need to convert to the SAF format (see featureCounts website for details).

ADD REPLY
0
Entering edit mode

Would you, in theory, given these set of consensus intervals, be able to use any count, normalization and difference methodology. I.e. kallisto and sleuth

ADD REPLY
0
Entering edit mode

Puh, I am not a statistician. That having said, kallisto-sleuth probably not because the tools (as far as i know) depend on each other, and kallisto's pseudoalignment is tailored for RNA-seq (experts on sleuth may correct me if the statement is wrong). I've seen publications use DESeq2 and edgeR for ATAC-seq plus a lot of FPKM followed by arbitrary fold-change threshold approaches. I would always stick with established tools whenever possible, unless you have expert knowledge that enables you to do things out-of-the-box. Do you have a specific approach in mind? RIght now, there is to my knowledge no specific ATAC-seq differential accessability tool that is frequently used. Therefore, the data are pretty much treated like ChIP-seq data, and for this DESeq2 is definitely an option. Tools like DiffBind for ChIP-seq use DESeq2 internally. But as you'll see when you google around, there are plenty of methodologies for differential analysis, all with their own pros and cons. Not being a statistician myself, I'll stick with established tools that are widely accepted and maintained by the developers plus have a good documentation.

ADD REPLY

Login before adding your answer.

Traffic: 1551 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6