Question

SummarizedExperiment nrow differs from ncol

0

Entering edit mode

9.9 years ago

Parham ★ 1.6k

I am trying to optimize GAGE workflow (1) for my samples. I only have one control and one replicate. I changed parameter "each" from 4 to 1 and I got an error. How should I change it so that it fit?

> grp.idx <- rep(c("experiment", "control"), each=1)
> coldat=DataFrame(grp=factor(grp.idx))
> dds <- DESeqDataSetFromMatrix(cnts, colData=coldat, design = ~ grp)
Error in validObject(.Object) : 
  invalid class "SummarizedExperiment" object: 'colData' nrow differs from 'assays' ncol

Cheers!

DESeq2 SummarizedExperiment • 9.7k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by Parham ★ 1.6k

2

Entering edit mode

What's the output of ncol(cnts)?

ADD REPLY • link 9.9 years ago by Devon Ryan 104k

1

Entering edit mode

> ncol(cnts)
[1] 8

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Parham ★ 1.6k

1

Entering edit mode

Right, so 8 samples with only 2 having a group assignment. The number of rows of coldat needs to equal the number of samples (i.e., the number of columns in cnts). Otherwise, you end up with samples with no group assignment and DESeq2 has no clue what you really want to do (thus, the error).

ADD REPLY • link 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

I see! Then I think I am doing it totally wrong may be! I thought this workflow is something general with basic commands that can be applied to my data so I can get start with! But it turned out more complicated than it seemed! I managed to map my genes with Trapnel et al 2012 protocol for tophat and cufflinks and was quiet successful and I was looking for something like that to learn with, but this one is very confusing to me.

I haven't changed anything in the workflow and tried it as it was, and surprisingly there are 8 samples?! Shouldn't it be two based on one "experiment" and one "control"? How do I change rows of coldat?!

Thanks Devon, you helped a lot so far.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Parham ★ 1.6k

1

Entering edit mode

It is a general workflow, but nothing can be so general that you give incorrect sample-group assignments (or wrong file names, or other things of that sort).

I would have to know how you made cnts to know why it has more columns than the two you expected. There should be one column for each sample, so either you actually do have more samples or you did something wrong when creating that matrix (that'd be my guess). You're making coldat from grp.idx, so just change how you make that. Having said that, don't just randomly change things until the error goes away. The values need to accurately describe the samples, or else you end up with some control samples being in the experiment group (and vice versa) and then the results are meaningless.

You might want to get a book on using R, since knowing the basic concepts will make your life vastly easier (most example workflows assume you know at least the basics).

ADD REPLY • link 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

All right, I see your point. Now I got a book on R and I am trying to get a better understanding of it. Since I am a biology grad student and with very little knowledge on statistics and bioinfs, its very useful to get tips from people like you to show the "how to" path. However, now I think I get a better output from the scripts, at least until some points. I am including all the scripts I wrote for my experiment. I have two conditions and one replicate each! Named them pfh1 and pfh1-de!

> library(GenomicFeatures)
> library(biomaRt)
> txdb<-makeTranscriptDbFromBiomart(biomart="fungi_mart_22", dataset="spombe_eg_gene", host="fungi.ensembl.org")
> exByGn <- exonsBy(txdb, "gene")

> library(Rsamtools)
> fls <- list.files("tophat_all_2/", pattern="bam$", full.names=T)
> bamfls <- BamFileList(fls)
> flag <- scanBamFlag(isNotPrimaryRead=FALSE, isProperPair=TRUE)
> param <- ScanBamParam(flag=flag)
> library(GenomicAlignments)
> gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand=TRUE, single.end=FALSE, param=param)
> gnCnt

class: SummarizedExperiment
dim: 7017 2
exptData(0):
assays(1): counts
rownames(7017): SPAC1002.01 SPAC1002.02 ... SPSNRNA.06 SPSNRNA.07
rowData metadata column names(0):
colnames(2): pfh1.bam pfh1-de.bam
colData names(0)

> pfh1.cnts=assay(gnCnt)
> require(gageData)

from gnCnt it seems that it sorts my data correctly, having 7017 gene (expected) in 2 columns (conditions). Am I right?

From this step I don't understand the concept of loading hnrnp data from gageData! Is that something I have to skip it for my analysis? That was an amateur mistake! Because the output of dim(pfh1.cnts) is [1] 7017 2!

I generally don't understand these following scripts! If you can kindly explain them please:

hnrnp.cnts=assay(gnCnt)

require(gageData)
data(hnrnp.cnts)
cnts=hnrnp.cnts

Thanks for reading all this!

Cheers!

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Parham ★ 1.6k

1

Entering edit mode

Yeah, it looks like gnCnt has the correct form now. The last three lines just load an example dataset so that you can actually run things in R without having your own data. There's no reason for you to bother with those.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks, what is happening here?

> sel.rn=rowSums(cnts) != 0
> cnts=cnts[sel.rn,]

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Parham ★ 1.6k

1

Entering edit mode

The first line can be broken into two parts: rowSums(cnts) != 0, which compares the sum of each row to zero and returns True if the sum > 0; and then sel.rn =, which just saves the vector of True/False values. The purpose of that is to just see which rows aren't all zeros. Then cnts=cnts[sel.rn,] subsets cnts to contain only those rows.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

I see, but then it seems I get another problem, when I run it on my data! I lose all genes! All of them have zero values?!

> cnts=assay(gnCnt)
> dim(cnts)
[1] 7017    2
> sel.rn=rowSums(cnts) !=0
> cnts=cnts[sel.rn,]
> dim(cnts)
[1] 0 2

How can I avoid this?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Parham ★ 1.6k

0

Entering edit mode

Check that the chromosome names of the BAM files match those of txdb. If they do, look at things in IGV or another genome browser.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

How do I check the chromosome names?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Parham ★ 1.6k

1

Entering edit mode

From the command line: samtools view -H foo.bam (change foo.bam to the name of a BAM file). For that genome, the chromosome should be I, II, III, MT and AB325691.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

Yes it looks it has the correct headers! What should I seek for in IGV?

@HD    VN:1.0    SO:coordinate
@SQ    SN:AB325691    LN:20000
@SQ    SN:I    LN:5579133
@SQ    SN:II    LN:4539804
@SQ    SN:III    LN:2452883
@SQ    SN:MT    LN:19431
@SQ    SN:MTR    LN:20128
@PG    ID:TopHat    VN:2.0.9    CL:/usr/bin/tophat -p 8 -o tophat_out_tag4 /home/parham/pombe/annotations/Schizosaccharomyces_pombe_Ensembl_EF2/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/Bowtie2Index/genome /home/parham/pombe/uppmax_data/131128_SN866_0274_BC2N9CACXX/Sample_S0001_tag4/S0001_tag4_TGACCA_L004_R1_001.fastq

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Parham ★ 1.6k

0

Entering edit mode

I also checked IGV. However I didn't know what you exactly mean but I randomly checked some genes and their mapped reads, which looked fine! Is there any thing specific I have to look on?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Parham ★ 1.6k

0

Entering edit mode

There's nothing specific. The goal it just to eye-ball a couple genes and see if they actually have reads mapping to them. If they do and the counts you're getting with summarizeOverlaps() to even remotely match, then something is going wrong at that step.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

Do you have any idea what is going wrong?!

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Parham ★ 1.6k

0

Entering edit mode

Not without seeing the data. You can always post the BAM file somewhere and I or someone else can just have a look.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

All right! I can do that but I don't know where do people upload! If you could provide some links please!

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Parham ★ 1.6k

0

Entering edit mode

Dropbox, google drive, just google for "large file send"

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Devon,

I have been having troubles with getting gage working and was following this thread to sort the issue. My chromosomes have the wrong headers. Is there an easy way to change these?

cheers in advance

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.6 years ago by justin.mitchell • 0

0

Entering edit mode

Hi Justin,

Please post things like this as new questions, they're easier to track and will receive quicker responses that way.

SummarizedExperiment objects inherit the seqlevels() accessor. So you can just change all of the names in one relatively simple command, like seqlevels(se) <- sprintf("chr%s", as.character(seqlevels(se))). Something along those lines should work for you.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.6 years ago by Devon Ryan 104k

1

Entering edit mode

You have single-end data, so you need this

flag <- scanBamFlag(isNotPrimaryRead=FALSE)

instead of this

flag <- scanBamFlag(isNotPrimaryRead=FALSE, isProperPair=TRUE)

Also, this

gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand=TRUE, single.end=FALSE, param=param)

should instead be this

gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand=TRUE, param=param)

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks a million Devon! It is fine now!

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.8 years ago by Parham ★ 1.6k

0

Entering edit mode

Devon may I ask how you looked up for single-end/pair-end in the BAM file please?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.5 years ago by Parham ★ 1.6k

0

Entering edit mode

I don't recall exactly what I did 4 months ago. If I had the BAM file, then I just looked at the flag fields. If not, I see that the header in one of your previous comments has the tophat command used, and that's only using single-end reads, so perhaps I went off of that (though I vaguely remember just looking at the dataset).

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

Yes, mine was a stupid question! I shared the BAM file here and you just checked something and said that I have single-end data! So do I have to convert BAM to SAM and check for flags?! If so is there any place explaining the flags?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.5 years ago by Parham ★ 1.6k

1

Entering edit mode

Check here or the SAM specification if you really want more details.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.5 years ago by Devon Ryan 104k

1

Entering edit mode

BTW, rep(c("experiment", "control"), each=1) is the same as c("experiment", "control").

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Devon Ryan 104k