Question

Tutorial:ClusterProfiler A software for functional enrichment of differentially expressed genes- A tutorial

9

Entering edit mode

3.1 years ago

Novogene ▴ 420

Hi, everyone,

Recently, I've learned the software clusterProfiler and made some relevant analysis and summary. Now, I'm going to share it with you.

As is known to all, DNA and RNA sequencing have gradually become widely used techniques in molecular biology research. Among them, RNA sequencing (RNA-seq) especially anchors itself on the basis of determining phenotypic variation. The phenotypic variation ranges from the population level under certain biological characteristics/conditions to the disease pathology of a person or group of people.

However, before starting an RNA-seq project, there are some fundamental questions helping to reveal and understand the molecular mechanisms of gene expression in various biological processes:

How to analyze the huge and complicated sequencing data into meaningful and obvious results?
How to select genes related to the phenotype of related organisms and relevant biological phenotypes, and find the biological pathways of significant differentiation?
How to quantify the expressed genes and further analyze the important candidate genes?

The analysis of RNA sequencing data usually starts with the identification of genes that are responsive to specific conditions or that have undergone experimental treatment, and show significant phenotypic variation between biological groups. Then, the sequencing reads are quantified and the process of variation in the transcriptome profile is estimated. There are several online and desktop tools (R and python packages) for differential expression analysis of aligned RNA-seq data. In the next step, the transcripts will be further refined and annotated through gene families and functions. The main purpose of this is to determine a group of genes belonging to a certain functional group, which will help us prove the phenotypic differences between the samples or groups. The tools for functional analysis of differentially expressed genes involve a variety of methods, and can be divided into three categories: ORA (Over-Representation Analysis), FCS (Functional Class Scoring) and PT (Pathology Topology).

ORA is the discovery of an over-represented gene or protein in a set of genes or proteins, which is currently the most common enrichment analysis. The most commonly used statistical tests are the hypergeometric test and the chi-square test. However, the shortcomings are also obvious, such as ignoring genes that are not significantly different and ignoring the mutual influence of genes. Considering the attribute information of gene expression value, compared with ORA, FCS has an obvious breakthrough in theory. The representative of FCS method is Gene Set Enrichment Analysis (GSEA). The principles of PT and FCS are similar, but the path topology method is used in the statistical test of genetic level, which is tightly dependent on the accuracy of accessing the database and is not very mature yet. There are many softwares for gene enrichment analysis. But today let's talk about one of the most common used, clusterProfiler.

What is "clusterProfiler"?

"clusterProfiler" is an R package that provides statistical tests for expression analysis of terms such as GO (Gene Ontology), which are related to gene lists that have shown statistically significant differences. When using this tool for GO analysis, only two inputs are required: a list of genes that show significant differentiation in expression and a list of background genes. Hypergeometric tests are used for statistical enrichment analysis test. The basic point of the package allows to select GO terms measuring BP (biological process), CC (cell component), MF (molecular function) to test.

The package supports the following types of analysis:

Over-Representation Analysis: This is a statistical method used to determine whether there are more genes in a pre-defined set of genes (for example, genes belonging to a specific GO term or KEGG pathway) than expected (over-represented) genes in a subset of your data.
Gene Set Enrichment Analysis: GSEA, also known as functional enrichment analysis, is a method used to identify genes or proteins that are over-represented in a large number, and there may be an associated relationship to disease phenotypes. GSEA performs as a the powerful analytical method for interpreting gene expression data. This method obtains its functions by focusing on gene sets (for example, genomes that share common biological functions, chromosomal location, or regulation).
Biological Theme Comparison: After enriching and classifying the significantly expressed genes in several gene clusters, researchers may have further interest in identifying and comparing biological themes in gene clusters.

The R package is available on Bioconductor (https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html). And it implements methods to analyze and visualize functional profiles (GO and KEGG) of gene and gene clusters. The package can take input data from different sources (MicroArray, RNA-seq). After differential analysis, enrichment analysis (ORA: GO enrichment, KEGG enrichment, and FCS: GSEA enrichment) can be performed using the packages and displayed visually.

How to use clusterProfiler?

1. ID conversion

For common model species, ID conversion can be achieved by the OrgDb annotation package in Bioconductor. It shows successful experience in more than 20 different species till now, such as humans, mice, fruit flies, zebrafish, etc.. and for uncommon species or non-model species, AnnotationHub can be used instead.

Steps can be:

BIClusterProfiler-1

2. GO/KEGG/GSEA Enrichment analysis

2.1 Step of GO analysis: BIClusterProfiler-2.2.1 BIClusterProfiler-2.2.1-2

2.2 Steps of GSEA enrichment on GO Analysis: BIClusterProfiler-2.2.2

2.3 Steps of KEGG Analysis BIClusterProfiler-2.2.3

2.4 Steps of GSEA enrichment on KEGG Analysis BIClusterProfiler-2.2.4

Seeing this, do you know how to use clusterProfiler to analyse your data? Let's take out the data and begin to practice! If you have any questions or comments, just leave them here so that we could make progress together.

clusterProfiler Software rna-seq • 11k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 3.1 years ago by Novogene ▴ 420

0

Entering edit mode

thanks for sharing. It's really clear for me, a fresh man. Would it be possible if you can show some figures as reference for me? thank you!

ADD REPLY • link 3.1 years ago by louisewang222 • 0

0

Entering edit mode

and now there is a new paper for clusterProfiler, https://www.sciencedirect.com/science/article/pii/S2666675821000667.

ADD REPLY • link 2.7 years ago by Guangchuang Yu ★ 2.6k

score 1 · Answer 1 · 2021-03-04

For your convenience, I also provide reproducible code here.

1. ID conversion

Steps can be:

# install
# install.packages("BiocManager")
# BiocManager::install("org.Hs.eg.db")


# import library
suppressMessages(library(org.Hs.eg.db))
suppressMessages(library(clusterProfiler))

# Usage 
#bitr(geneID, fromType, toType, OrgDb, drop = TRUE)

> bitr(c("BRCA1","TP53"), fromType="SYMBOL", toType=c("ENTREZID","ENSEMBL"), OrgDb="org.Hs.eg.db");
'select()' returned 1:1 mapping between keys and columns
  SYMBOL ENTREZID         ENSEMBL
1  BRCA1      672 ENSG00000012048
2   TP53     7157 ENSG00000141510


> bitr(c("ENSG00000012048","ENSG00000141510"), fromType="ENSEMBL", toType=c("ENTREZID","SYMBOL"), OrgDb="org.Hs.eg.db");
'select()' returned 1:1 mapping between keys and columns
          ENSEMBL ENTREZID SYMBOL
1 ENSG00000012048      672  BRCA1
2 ENSG00000141510     7157   TP53


# Usage by kegg API
#bitr_kegg(geneID, fromType, toType, organism, drop = TRUE)

> bitr_kegg(c("7157","672"), fromType="kegg", toType="Path", organism="hsa");
   kegg     Path
1  7157 hsa01522
2   672 hsa01524
3  7157 hsa01524
4   672 hsa03440
    .
    .
    .
57 7157 hsa05418


> bitr_kegg(c("7157","672"), fromType="kegg", toType="ncbi-proteinid", organism="hsa");
  kegg ncbi-proteinid
1  672      NP_009225
2 7157      NP_000537

score 1 · Answer 2 · 2021-03-04

2. GO/KEGG/GSEA Enrichment analysis

2.1 Step of GO analysis:

1.Prepare the raw file (bp).
> head(term2gene_bp)
       go_id         gene_id
1 GO:0007166 ENSG00000000003
2 GO:0043123 ENSG00000000003
3 GO:0039532 ENSG00000000003
4 GO:1901223 ENSG00000000003
5 GO:0007165 ENSG00000000003
6 GO:0007154 ENSG00000000003
> head(term2name_bp)
       go_id
1 GO:0007166
2 GO:0043123
3 GO:0039532
4 GO:1901223
5 GO:0007165
6 GO:0007154
                                                                                          go_term
1                                                         cell surface receptor signaling pathway
2                                      positive regulation of I-kappaB kinase/NF-kappaB signaling
3 negative regulation of viral-induced cytoplasmic pattern recognition receptor signaling pathway
4                                                  negative regulation of NIK/NF-kappaB signaling
5                                                                             signal transduction
6                                                                              cell communication

2.. Differential Gene List.
> head(diffgene_vt)
[1] "ENSG00000271503" "ENSG00000072274" "ENSG00000134539" "ENSG00000139679"
[5] "ENSG00000130707" "ENSG00000196683"

score 0 · Answer 3 · 2021-03-04

3.. Enrich the genes.
BPenrich <- enricher(gene=diffgene_vt,TERM2GENE=term2gene_bp,TERM2NAME=term2name_bp,pAdjustMethod='BH',pvalueCutoff=1,qvalueCutoff=1)

> head(BPenrich)
                   ID
GO:0043620 GO:0043620
GO:0043618 GO:0043618
GO:0033993 GO:0033993
GO:0002181 GO:0002181
GO:0048511 GO:0048511
GO:0045444 GO:0045444
                                                                                 Description
GO:0043620                   regulation of DNA-templated transcription in response to stress
GO:0043618 regulation of transcription from RNA polymerase II promoter in response to stress
GO:0033993                                                                 response to lipid
GO:0002181                                                           cytoplasmic translation
GO:0048511                                                                  rhythmic process
GO:0045444                                                          fat cell differentiation
           GeneRatio   BgRatio       pvalue   p.adjust      qvalue
GO:0043620     6/153  47/16700 4.221260e-06 0.01049405 0.009073486
GO:0043618     5/153  43/16700 4.394608e-05 0.03557923 0.030762923
GO:0033993    15/153 493/16700 4.824066e-05 0.03557923 0.030762923
GO:0002181     7/153 107/16700 5.724736e-05 0.03557923 0.030762923
GO:0048511     9/153 197/16700 8.510415e-05 0.03579917 0.030953086
GO:0045444     9/153 198/16700 8.846914e-05 0.03579917 0.030953086
                                                                                                                                                                                                                                                                                                 geneID
GO:0043620                                                                                                                                                                            ENSBTAG00000010069/ENSBTAG00000008545/ENSBTAG00000004037/ENSBTAG00000033304/ENSBTAG00000051972/ENSBTAG00000022799
GO:0043618                                                                                                                                                                                               ENSBTAG00000010069/ENSBTAG00000008545/ENSBTAG00000004037/ENSBTAG00000051972/ENSBTAG00000022799
GO:0033993 ENSBTAG00000004037/ENSBTAG00000000507/ENSBTAG00000023179/ENSBTAG00000015026/ENSBTAG00000012046/ENSBTAG00000037811/ENSBTAG00000008573/ENSBTAG00000003650/ENSBTAG00000014429/ENSBTAG00000051972/ENSBTAG00000000163/ENSBTAG00000037778/ENSBTAG00000001024/ENSBTAG00000025434/ENSBTAG00000001864
GO:0002181                                                                                                                                                         ENSBTAG00000032433/ENSBTAG00000018320/ENSBTAG00000046623/ENSBTAG00000018628/ENSBTAG00000009757/ENSBTAG00000049723/ENSBTAG00000014423
GO:0048511                                                                                                                   ENSBTAG00000010069/ENSBTAG00000011437/ENSBTAG00000016169/ENSBTAG00000014127/ENSBTAG00000000706/ENSBTAG00000018093/ENSBTAG00000009012/ENSBTAG00000052562/ENSBTAG00000006122
GO:0045444                                                                                                                   ENSBTAG00000000507/ENSBTAG00000008573/ENSBTAG00000003650/ENSBTAG00000014127/ENSBTAG00000051972/ENSBTAG00000007614/ENSBTAG00000052562/ENSBTAG00000025434/ENSBTAG00000001864
           Count
GO:0043620     6
GO:0043618     5
GO:0033993    15
GO:0002181     7
GO:0048511     9
GO:0045444     9
>

4 . Draw the DAG Diagram.

BPenrich@ontology="BP"
pdf("GObp_DAG.pdf")
plotGOgraph(BPenrich,firstSigNodes = 5)
dev.off()

2.2 Steps of GSEA enrichment on GO Analysis:

#GSAE on Differential Genes
1.Prepare the foldchange value.
> head(foldchange)
ENSG00000271503 ENSG00000072274 ENSG00000134539 ENSG00000139679 ENSG00000130707
       3.261722       -2.881577        3.816291        2.985797        5.032315
ENSG00000196683
       2.217495

2.Enrich the genes.
BPenrich <- GSEA(geneList=foldchange,TERM2GENE=term2gene_bp,TERM2NAME=term2name_bp,pvalueCutoff=1,pAdjustMethod="BH")

3.Draw the DAG Diagram.

score 0 · Answer 4 · 2021-03-04

2.3 Steps of KEGG Analysis:

1.Prepare the relationship of genes and pathways, and take it them as background.
> head(term2gene)
  pathway_id         gene_id
1   hsa00010 ENSG00000117448
2   hsa00010 ENSG00000187758
3   hsa00010 ENSG00000196616
4   hsa00010 ENSG00000248144
5   hsa00010 ENSG00000198099
6   hsa00010 ENSG00000197894
> head(term2name)
  pathway_id                 pathway_name
1   hsa00010 Glycolysis / Gluconeogenesis
2   hsa00010 Glycolysis / Gluconeogenesis
3   hsa00010 Glycolysis / Gluconeogenesis
4   hsa00010 Glycolysis / Gluconeogenesis
5   hsa00010 Glycolysis / Gluconeogenesis
6   hsa00010 Glycolysis / Gluconeogenesis


2. Differential Gene List.
> head(diffgene_vt)
[1] "ENSG00000271503" "ENSG00000072274" "ENSG00000134539" "ENSG00000139679"
[5] "ENSG00000130707" "ENSG00000196683"


3.Enrich the genes as the same. 
KEGGenrich <- enricher(gene=diffgene_vt,TERM2GENE=term2gene,TERM2NAME=term2name,pAdjustMethod='BH',pvalueCutoff=1,qvalueCutoff=1)

4.Draw the scatter diagram.
pdf(file="keggDot.pdf")
dotplot(KEGGenrich, showCategory=30)
dev.off()

2.4 Steps of GSEA enrichment on KEGG Analysis BIClusterProfiler-2.2.4

#. GSEA Enrichment on differential genes.
1.Prepare the foldchang value.
> head(foldchange)
ENSG00000156427 ENSG00000274767 ENSG00000246100 ENSG00000255819 ENSG00000278996
       5.739748        5.735431        5.477079        5.365904        5.276536
ENSG00000156564
       5.230932
> KEGGenrich <- GSEA(geneList=foldchange,TERM2GENE=term2gene,TERM2NAME=term2name,pvalueCutoff=1,pAdjustMethod="BH")
Error in GSEA_internal(geneList = geneList, exponent = exponent, nPerm = nPerm,  :
  geneList should be a decreasing sorted vector...

>foldchange <- sort(foldchange,decreasing=TRUE)
> KEGGenrich <- GSEA(geneList=foldchange,TERM2GENE=term2gene,TERM2NAME=term2name,pvalueCutoff=1,pAdjustMethod="BH")
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...

2.Draw the GSEA scatter diagram.>

score 0 · Answer 5 · 2022-09-04

Novogene do you have done a three-time analysis separately using enricher to separate results on the basis of ontology. I mean in the first analysis your TERM2GENE parameter only contains gene_id with "BP" terms and not "CC" and "MF" in both DEG and universe(background). will it not affect biological interpretation?