Duplicate gene names in count matrix from GTF file that can't be parsed to edgeR
1
0
Entering edit mode
7.1 years ago
Chloe • 0

Hi all,

I'm having an issue with duplicate gene names/IDs which are interfering with parsing my count matrix to edgeR.

I have a GFF3 file which I converted to GTF using:

gffread my.gff3 -T -o my.gtf

I am able to produce a count matrix using this gtf file thorugh HTSeq. However, when I try to use edgeR to read the file (I have tried both in command line and using the website Galaxy) it cannot read the file due to duplicate gene names.

This is the specific error I get in edgeR if that helps (I get the same error putting row.names=1):

>  x <- read.delim("counts.txt",row.names="Contig")
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
duplicate 'row.names' are not allowed

Looking at the GTF file, while there are multiple lines with the same gene name/ID/transcript ID etc., this is because they are different features of the same gene (the phase and position on the strand of each exon/part of the CDS)

Therefore it seems either I need to tell HTSeq to differentiate identical gene names based on position/feature or I need to convert GFF3->GTF in such a way that each gene name is unique/there is only one line that encompasses all the information it needs

Does anyone know which is the best way to do this (and how??) ?

(I have a feeling I will need to do this by changing some settings in HTSeq, so if anyone knows how to do this in Galaxy that would be amazing)

Many thanks,

Chloe

RNA-Seq genome DGE count matrix edgeR • 3.8k views
ADD COMMENT
0
Entering edit mode

Hi Chloe,

There are more solutions possible I think. Best solution is to find a GFF3 or GTF file with unique identifiers, such as ensembl accession for instance.

Other (less optimal) option is to read your table into R without row.names argument. And then assign the row.names later manually from your "Contig" column (although I expect problems when you use read.table with characters and numeric mixed).

ADD REPLY
0
Entering edit mode
7.0 years ago
ytian • 0

I think you can make a list of those duplicate genes with down below codes (I assume your matrix with gene symbol as colnames)

n_occur <- data.frame(table(Data$Columns))

n_occur[n_occur$Freq > 1,]

Then you can proceed to locus information or coding length to verify your gene list

ADD COMMENT

Login before adding your answer.

Traffic: 2623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6