collapsing transcript expression to gene-level expression
1
0
Entering edit mode
5.1 years ago
zizigolu ★ 4.3k

Hi,

I have a list of bam files from RNA-seq; By FeatureCounts and annotation.gtf I obtain a weird matrix

featureCounts" "-t" "exon" "-g" "gene_id" "-a" "annotation.gtf" "-o" "counts.txt" bam1 bam2 ...


Geneid             bam1   bam2  bam3
ENSG00000223972.4   0    0  0
ENSG00000227232.4   912 338 578
ENSG00000243485.2   0   0   0
ENSG00000237613.2   0   0   0
ENSG00000268020.2   0   0   0
ENSG00000240361.1   0   0   0
ENSG00000186092.4   0   0   0
ENSG00000238009.2   11  1   12
ENSG00000239945.1   0   0   0
ENSG00000233750.3   0   5   0

I need to collapse Transcript expression to gene-level expression providing the transcript-to-gene mapping from the Ensembl gene annotation.

tximport R package does not support any thing with matrix coming from FeatureCounts

Thank you

FeatureCounts BAM RNA-Seq • 4.1k views
ADD COMMENT
1
Entering edit mode

@ F

Annotations are coming from input gtf ("annotation.gtf", source: Ensembl) i.e 52871 gene IDs are coming from annotation gtf. To get ~20,000 genes, prune gtf to include only coding genes (genes that get translated).

try following code on annotation.gtf:

$ awk -F "\t" '$3 == "gene" { print $9 }' annotation.gtf |awk -F "; " '$5 == "gene_biotype \"protein_coding\";"' | wc -l

This should give you around 19922. Please change the code as per gene_biotype feature and number of fields in feature list of annotation.gtf. If correctly run, the code should give you number of protein coding genes (lines in gtf) in annotation.gtf. For human gtf from ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens, number is around ~ 20k. code:

$ awk -F "\t|; " '{if ($3 =="gene" && $13=="gene_biotype \"protein_coding\";") print  }' Homo_sapiens.GRCh38.95.chr.gtf  | wc -l 

19922
ADD REPLY
0
Entering edit mode

Thanks a lot, actually I need to collapse Transcript expression to gene-level expression providing the transcript-to-gene mapping from the Ensembl gene annotation (protein and non protein coding)

ADD REPLY
1
Entering edit mode

Try following "annotation_genes_only.gtf" in feature coutns:

$ awk -v OFS="\t" -F "\t|; " '/^#/ {print};{if ($3 =="exon" && $11=="gene_type \"protein_coding\"") print }'  annotation.gtf  > annotation_genes_only.gtf

You should be getting 19907 genes.

ADD REPLY
0
Entering edit mode

Sorry FeatureCounts saying

Load annotation file annotation_genes_only.gtf ...                         ||
ERROR: no features were loaded in format GTF. The annotation format can be specified by the '-F' option, and the required feature type can be specified by the '-t' option..
The porgram has to terminate.

[fi1d18@cyan01 bin]$
ADD REPLY
1
Entering edit mode

Make sure that annotation gtf (original) file is tab separated and new gtf file is as per gt format.

ADD REPLY
0
Entering edit mode

Thank you, featurecounys running with your code for changing gtf,

My question is, can I use read counts of this for differential expression?

Because I guess here fastq files have been aligned on genome (GRCh37_g1k genome assembly) so I am just wondering if I can still use raw read counts of these bam files for differential expression

ADD REPLY
0
Entering edit mode

Sorry, FeatureCount finished but again I have 12437 ENSG00000186092.6 in my results

Does this mean they are all protein coding? Actually my ultimate goal of this post is colapsing transcrip level to gene level

ADD REPLY
0
Entering edit mode

there is some thing incorrect either with input gtf or output counts.

$ grep  ENSG00000186092.6 gencode.v29.annotation.gtf | wc -l
19
ADD REPLY
4
Entering edit mode
5.1 years ago
h.mon 35k

edit:

I need to collapse Transcript expression to gene-level expression

I will repeat: your counts already are at gene level. You don't need tximport, because you don't have transcript-level counts, you have gene-level counts.

By the way, why the count matrix is "weird"?

end of edit.

This is already at gene level, the Geneid column are Ensembl gene identifiers:

Gene: WASH7P (ENSG00000227232)

Gene: DDX11L1 (ENSG00000223972)

And so on...

ADD COMMENT
0
Entering edit mode

Sorry but I assumed 20000 genes while here I am seeing 52871 genes in rows :(

ADD REPLY
1
Entering edit mode

Why did you expect 2000 genes if the number of annotated genes in the human genome is much higher (see number of genes in human genome )? Is this a custom gtf?

ADD REPLY
0
Entering edit mode

Sorry I meant 20000, may be these 20000 only are protein coding

How could I have this matrix by gene symbol instead?

ADD REPLY
1
Entering edit mode

Show us a few lines of annotation file. Especially the fields to the right.

ADD REPLY
0
Entering edit mode

Thank you

This is gtf I used for featurecounts

##description: evidence-based annotation of the human genome (GRCh38), version 29 (Ensembl 94)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2018-08-30
chr1    HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
ADD REPLY
2
Entering edit mode

Instead of -g gene_id you could have simply used -g gene_name to get the gene names in your count file.

ADD REPLY
0
Entering edit mode

Thanks a lot for attempting to clear my confusion

I am afraid that my bam files coming from the alignment of original fastq files on genome (GRCh37_g1k genome assembly) , That makes my worried a lot

I am just wondering if my bam files are really coming from the alignment of original fastq files on genome instead of transcriptom, can I still use raw read counts extract by featurecounts for differential expression analysis?

I got totally confused because in this paper

https://www.nature.com/articles/s41467-018-05190-9

They are saying

Gene expression was quantified using Salmon in quasi-mapping mode and with parameters −l = ISR and −p = 10. The Ensembl gene annotation GRCh37.87 was provided. Transcript expression was collapsed to gene-level expression in R using tximport and providing the transcript-to-gene mapping from the Ensembl gene annotation.

Why they are doing that?

ADD REPLY
1
Entering edit mode

How did you align the reads? Did you use a reference genome or transcriptome? I believe GRCh37_g1k is a reference genome, and you used featureCounts with a genome annotation gtf - but it is just a guess, how could I know for sure?

The paper you linked used a different method: Using a reference transcriptome, Salmon counts over transcripts, then tximport summarizes the counts over genes.

Why would you be confused by a different method used by a third party?

ADD REPLY
0
Entering edit mode

Sorry, because the authors of the mentioned paper are our collaborators. I have been given a set of bam files. I am not sure they are resulted from aligning fastq files on genome or transcriptom. For example the names of one of my bam files is

HUMAN_1000Genomes_hs37d5_RNA_seq_WTSI-COLO_019_1pre.dupmarked.bam

(That is why I suspect of a reference genome)

One of my goals is finding differentially expressed genes

My confusion is; Basically could I find differentially expressed genes from these bam files ignoring the reference genome (GRCh37.87 or GRCh37_g1k) ?

ADD REPLY
1
Entering edit mode

As long as the reference the data is aligned to is identical for all samples, you should be able to do your DE analysis normally.

I am not sure they are resulted from aligning fastq files on genome or transcriptom

Did you look in the header of your BAM's to see if the alignment program used captured the command line there? Can give you a clue as to what (type of) reference was used.

BTW: If they are collaborators this type of information should just be an email/call away. Have you asked them?

ADD REPLY
1
Entering edit mode

These issues can be easily solved by contacting the said collaborators.

With some detective work, I know you mapped to a reference genome: you showed some lines of the gtf annotation, which have chr1 as sequence name, so it means the gtf is from a genome annotation, and featureCounts could assign counts to genes, which it would not do if the gtf and bam reference sequence names didn't match. However, the same detective work indicates you did something wrong: the gtf and genome version doesn't match. You said the bam was aligned to GRCh37_g1k, but the gtf says it is for GRCh38:

description: evidence-based annotation of the human genome (GRCh38), version 29 (Ensembl 94)

You can also check with samtools:

samtools view -h HUMAN_1000Genomes_hs37d5_RNA_seq_WTSI-COLO_019_1pre.dupmarked.bam | head

The output for a genome-mapped bam will most likely contain chromosome names:

@HD     VN:1.4  SO:coordinate
@SQ     SN:chr1 LN:197608386
@SQ     SN:CHR1_134_RANDOM      LN:1765
@SQ     SN:CHR1_360_RANDOM      LN:2384

The output for a transcriptome-mapped bam will most likely contain transcript names (in the example bellow, NCBI nuccore transcript identifiers):

@SQ     SN:XM_025152649.1       LN:431
@SQ     SN:XM_025152753.1       LN:4244
@SQ     SN:XM_025152751.1       LN:4283
@SQ     SN:XM_025152757.1       LN:4214

After checking if the bam corresponds to a genome or transcriptome reference, then you can follow the appropriate downstream analysis workflow.

ADD REPLY
0
Entering edit mode

Thank you

This is the head of one of my bam files

[fi1d18@cyan01 ~]$ module load samtools/1.3.2
[fi1d18@cyan01 ~]$ samtools view -h /temp/hgig/fi1d18/1672_WTSI-COLO_019_1pre/mapped_sample/HUMAN_1000Genomes_hs37d5_RNA_seq_WTSI-COLO_019_1pre.dupmarked.bam | head
@HD     VN:1.4  SO:coordinate
@SQ     SN:1    LN:249250621
@SQ     SN:2    LN:243199373
@SQ     SN:3    LN:198022430
@SQ     SN:4    LN:191154276
@SQ     SN:5    LN:180915260
@SQ     SN:6    LN:171115067
@SQ     SN:7    LN:159138663
@SQ     SN:8    LN:146364022
@SQ     SN:9    LN:141213431
[fi1d18@cyan01 ~]$

If I am not wrong this is a genome aligned work, so I have extracted right raw read counts from FeatureCounts for differential analysis?

ADD REPLY
1
Entering edit mode

That is correct. Did you look farther enough down in the headers to see if the command line has been captured? That should be the last line before the alignments.

ADD REPLY
0
Entering edit mode

Sorry I typed this but not showing anything about command used

samtools view /temp/hgig/fi1d18/1672_WTSI-COLO_019_1pre/mapped_sample/HUMAN_1000Genomes_hs37d5_RNA_seq_WTSI-COLO_019_1pre.dupmarked.bam  | head -10
ADD REPLY
1
Entering edit mode

You will need to view the entire header.

samtools view -H your.bam

Please don't start with "sorry" when you post anything new.

ADD REPLY
0
Entering edit mode

Thank you

ID:STAR PN:STAR VN:STAR_2.5.0c  CL:/software/CGP/canpipe/live-1.27.0/bin/STAR   --runMode alignReads   --runThreadN 4   --genomeDir /lustre/scratch112/sanger/cgppipe/canpipe/live/ref/Human/GRCh37d5/star   --genomeLoad NoSharedMemory   --genomeSAindexNbases 14   --genomeChrBinNbits 18   --genomeSAsparseD 1   --readFilesIn /lustre/scratch112/sanger/cgppipe/canpipe/live/data/lane/1672_344790/tmpStar/input/WTSI-COLO_019_1pre.1_1.fastq.gz   /lustre/scratch112/sanger/cgppipe/canpipe/live/data/lane/1672_344790/tmpStar/input/WTSI-COLO_019_1pre.1_2.fastq.gz      --readFilesCommand zcat      --readMatesLengthsIn NotEqual   --readMapNumber 18446744073709551615   --bamRemoveDuplicatesMate2basesN 0   --limitGenomeGenerateRAM 31000000000   --limitIObufferSize 150000000   --limitOutSAMoneReadBytes 100000   --limitOutSJcollapsed 1000000   --limitOutSJoneRead 1000   --limitBAMsortRAM 64606632121   --limitSjdbInsertNsj 1000000   --outFileNamePrefix /lustre/scratch112/sanger/cgppipe/canpipe/live/data/lane/1672_344790/tmpStar/star/   --outStd Log   --outReadsUnmapped None   --outQSconversionAdd 0   --outSAMtype BAM   Unsorted      --outSAMmode Full   --outSAMstrandField intronMotif   --outSAMattri
ADD REPLY
0
Entering edit mode

Sorry, I just noticed in these bam files I have paired fastq files, I was already converting bam to fastq by bedtools to use salmon for transcript counting. by

bamtofastq -i /temp/hgig/fi1d18/1672_WTSI-COLO_027_1pre/mapped_sample/HUMAN_1000Genomes_hs37d5_RNA_seq_WTSI-COLO_027_1pre.dupmarked.bam -fq /temp/hgig/fi1d18/1672_WTSI-COLO_027_1pre/mapped_sample/HUMAN_1000Genomes_hs37d5_RNA_seq_WTSI-COLO_027_1pre.dupmarked.fq

I am getting only fastq, so is it ok? Or I need to retrieve both fastq files?

ADD REPLY
0
Entering edit mode

Have you tried providing:

-fq2    FASTQ for second end. Used if BAM contains paired-end data. BAM should be sorted by query name (samtools sort -n aln.bam aln.qsort) if creating paired FASTQ with this option.

in addition to your option above?

ADD REPLY

Login before adding your answer.

Traffic: 2918 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6