Hi,
I have a list of bam files from RNA-seq; By FeatureCounts and annotation.gtf I obtain a weird matrix
featureCounts" "-t" "exon" "-g" "gene_id" "-a" "annotation.gtf" "-o" "counts.txt" bam1 bam2 ...
Geneid bam1 bam2 bam3
ENSG00000223972.4 0 0 0
ENSG00000227232.4 912 338 578
ENSG00000243485.2 0 0 0
ENSG00000237613.2 0 0 0
ENSG00000268020.2 0 0 0
ENSG00000240361.1 0 0 0
ENSG00000186092.4 0 0 0
ENSG00000238009.2 11 1 12
ENSG00000239945.1 0 0 0
ENSG00000233750.3 0 5 0
I need to collapse Transcript expression to gene-level expression providing the transcript-to-gene mapping from the Ensembl gene annotation.
tximport R package does not support any thing with matrix coming from FeatureCounts
Thank you
@ F
Annotations are coming from input gtf ("annotation.gtf", source: Ensembl) i.e 52871 gene IDs are coming from annotation gtf. To get ~20,000 genes, prune gtf to include only coding genes (genes that get translated).
try following code on annotation.gtf:
This should give you around 19922. Please change the code as per gene_biotype feature and number of fields in feature list of annotation.gtf. If correctly run, the code should give you number of protein coding genes (lines in gtf) in annotation.gtf. For human gtf from ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens, number is around ~ 20k. code:
Thanks a lot, actually I need to collapse Transcript expression to gene-level expression providing the transcript-to-gene mapping from the Ensembl gene annotation (protein and non protein coding)
Try following "annotation_genes_only.gtf" in feature coutns:
You should be getting 19907 genes.
Sorry FeatureCounts saying
Make sure that annotation gtf (original) file is tab separated and new gtf file is as per gt format.
Thank you, featurecounys running with your code for changing gtf,
My question is, can I use read counts of this for differential expression?
Because I guess here fastq files have been aligned on genome (GRCh37_g1k genome assembly) so I am just wondering if I can still use raw read counts of these bam files for differential expression
Sorry, FeatureCount finished but again I have 12437 ENSG00000186092.6 in my results
Does this mean they are all protein coding? Actually my ultimate goal of this post is colapsing transcrip level to gene level
there is some thing incorrect either with input gtf or output counts.