Question

(edgeR-related) DGElist error message 'Negative counts not allowed'.

1

Entering edit mode

5.1 years ago

fmerkal ▴ 60

Hello,

After counting genomic features with featureCounts

featureCounts -F GTF -t exon -g gene_id -f -O --minOverlap 2 -M --fraction -J -a annotation.gtf -o out.txt 1_sorted.bam 2_sorted.bam 3_sorted.bam 4_sorted.bam

I decided to use the output file as an input for edgeR to create the DGElist. First I used the read.delim function in R to read the text file and create a table. The command I used is shown below.

counts <- read.delim("~/Desktop/thesis_bioinformatics/chick_fl_hl_comparison/counts_fcounts.txt", sep="\t", header = TRUE, check.names = FALSE, stringsAsFactors = FALSE)

After getting the table, I tried using it to create the DGElist by using the commands below.

group<-c(1,1,2,2)
y <- DGEList(counts=counts, group=group)

This outputs an error message that says Error: Negative counts not allowed. I followed someones advice from this post Question: Error when running edgeR, where they suggest using is.na(counts) || counts < 0 to check if NA or negative counts are present. I checked all the columns of my counts file and the only column where the output was true was the Strand column, which describes the directionality. I removed this column from the table and tried running DGElist again with a counts file that did not have the Strands column. The same error message appeared. I'm pretty confused as to why I get this error message when none of my counts are negative. Any help would be appreciated.

Cheers, Fjodor

The start of my counts table looks something like this:

  Geneid              Chr  Start   End   Strand   Length   1   2   3   4

1 ENSGALG00000054818   1    9774  10061     -       288    #   #   #   #

RNA-Seq rna-seq software error R • 9.9k views

ADD COMMENT • link updated 5.1 years ago by Gordon Smyth ★ 7.0k • written 5.1 years ago by fmerkal ▴ 60

0

Entering edit mode

Did you manipulate out.txt prior to reading it with the read.delim command? With this command you should not be able to read a featureCounts output. It should complain about the header line (starting with # Program....) not matching the number of columns. Please show head -n 10 counts_fcounts.txt

ADD REPLY • link 5.1 years ago by ATpoint 81k

0

Entering edit mode

Yes, I did remove the header and gave the counts columns easier names.

Geneid  Chr Start   End Strand  Length  FL1 FL2 HL1 HL2

ENSGALG00000054818  1   9774    10061   -   288 0   0   0   0

ENSGALG00000054818  1   9314    9364    -   51  0   0   0   0

ENSGALG00000054818  1   8674    8904    -   231 2.10    1.67    1.33    2.20

ENSGALG00000054818  1   6755    6829    -   75  2.90    1.52    2.46    2.75

ADD REPLY • link 5.1 years ago by fmerkal ▴ 60

score 3 · Answer 1 · 2019-03-13

This is how I would do it:

library(edgeR)
fc <- read.delim("counts_fcounts.txt", stringsAsFactors = FALSE)
genes <- fc[,1:6]
counts <- data.matrix(fc[,7:11])
row.names(counts) <- paste(genes$Geneid, genes$Start, genes$End, sep=".")
group <- factor( c("FL","FL","HL","HL") )
y <- DGEList(counts, genes=genes, group=group)

score 2 · Answer 2 · 2019-03-13

2

Entering edit mode

5.1 years ago

h.mon 35k

After reading the output of featureCounts, you have to subset the data frame to include just the counts:

counts <- read.delim("~/Desktop/thesis_bioinformatics/chick_fl_hl_comparison/counts_fcounts.txt", sep="\t", header = TRUE, check.names = FALSE, stringsAsFactors = FALSE)
rownames( counts ) <- counts$Geneid
counts <- counts[ , -c(1:6) ]

ADD COMMENT • link 5.1 years ago by h.mon 35k

0

Entering edit mode

Hello, Thank you for your reply! This is a great idea and I tried it. The only issue is that there appear to be non-unique values in the Geneid column and this doesn't allow me to set the Geneid column values as row names. Could I still keep the R assigned row names and only delete columns 2:6?

ADD REPLY • link 5.1 years ago by fmerkal ▴ 60

0

Entering edit mode

Take everything I say bellow with a grain of salt, as I always performed summarization over genes, never over exons, and I don't know the recommended settings for exon-level summarization.

I suppose there are non-unique Geneids because you are summarizing over exons (flag -f) instead of over genes, and the exons don't have unique identifiers - I am really not sure, because I never used featureCounts with all these options (`-M -O --fraction -f -J), and it has been a long time since I last used featureCounts. Could you post a snippet of you GTF annotation?

Also, you may want to remove multi-mapped reads from your bam files, as those will be counted (flag -M) and featureCounts (with --fraction) use the naive method of just splitting the counts evenly between all features overlapped by the multi-mappers.

ADD REPLY • link 5.1 years ago by h.mon 35k

0

Entering edit mode

Hi, I was thinking about summarizing over genes instead of exons, after reading your comment, but I am interested in looking at alternative splicing as well as differential gene expression so I think that wouldn't be a good idea for my situation.

I looked more into this problem and found out that there is an option in DGElist called genes where you can specify the Geneids and other information related chromosome number and library size. Maybe this is a well-known thing but I am very new to RNA-seq analysis so I didn't know anything about it. Regardless, I tried the following with the counts table described earlier in mind and it seemed to work just fine.

library(edgeR)
counts <- read.delim("~/Desktop/thesis_bioinformatics/chick_fl_hl_comparison/counts_fcounts.txt", sep="\t", header = TRUE, check.names = FALSE, stringsAsFactors = FALSE)
View(counts)
genes <- counts[1:6]
group<-c(1,1,2,2)
y<-DGEList(counts = cbind(counts$FL1, counts$FL2, counts$HL1, counts$HL2), genes=genes, group=group)

Thank you for your comments, they definitely lead me to the right direction :)

ADD REPLY • link 5.1 years ago by fmerkal ▴ 60