Question

Making a counts

0

Entering edit mode

5.5 years ago

Nick • 0

Hi everyone, I'm a newbie in R and RNA seq analysis, so don't judge me so much) I'm trying to perform RNA-seq analysis as it says here http://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html , I was given a bunch of bam/bam.bai files(C33A and HeLa) 3 scrambled and 3 shoct4 files for each gene, so as a first step I need to do counts, here is the first question: Should I use featurecounts functions for all bam files of one type of genes or parse them separately?

Secondly, I tried to parse all files as separately and together and got counts txt files, which different http://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html enter image description here

enter image description here

Tutorial count file: My Count file:

RNA-Seq counts bam feauturecounts • 4.3k views

ADD COMMENT • link updated 5.5 years ago by h.mon 35k • written 5.5 years ago by Nick • 0

1

Entering edit mode

Use all files together when you feed them into featureCounts. That way you get a single matrix (genes in rows, samples in columns) that would be easier to use in downstream DE apps.

ADD REPLY • link 5.5 years ago by GenoMax 141k

1

Entering edit mode

Hi frozmanik,

welcome to Biostars. What you are asking are valid, but very basic questions that have been addressed a lot of times before. Therefore, I recommend that you spend quality time on reading this excellent tutorial/workflow on RNA-seq analysis, which also covers the questions you raised here. Please use google and the search function extensively because many of the common problems you'll encounter when reproducing the steps covered in the tutorial are extensively described here on Biostars as well as other communities or scientific blogs. Please also check How to add images to a Biostars post, and make sure the link you use is indeed correct. The link to the picture in your question seems to be broken. Cheers!

ADD REPLY • link 5.5 years ago by ATpoint 81k

0

Entering edit mode

Hello

use following commands for bam file, to create count matrix after aligning reads with genome

# in Linux
for f in *.sam; do samtools sort -@ 8 -o ${f/.sam/.bam} $f ;done
for f in *.bam; do htseq-count -f bam -i gene_name $f  genome_annotation.gtf  > ${f/.bam/.counts}; done 

#in R 
f <- list.files(".", "*.counts")
counts <- lapply(f, read.delim, header=F)
counts <- do.call(cbind, counts)
#remove 5 lines at the end
counts <- counts[-c(54135:54139),]
rownames(counts) <- counts[,1]
counts <- counts[,seq(ncol(counts)/2)*2]
colnames(counts) <- sub(".counts","",f)

ADD REPLY • link 5.5 years ago by ahmad mousavi ▴ 800

1

Entering edit mode

This might work for your specific situation, but because row numbers in counts <- counts[-c(54135:54139),] are hardcoded, it is not generic, and also rather complicated for something that can be solved with a one-liner like:

./featureCounts (options...) -a in.gtf -o countmatrix.txt *.bam

ADD REPLY • link 5.5 years ago by ATpoint 81k

0

Entering edit mode

While this answer may be correct on it own it is not addressing the question asked by OP. So I am moving this to a comment.

ADD REPLY • link 5.5 years ago by GenoMax 141k

score 1 · Answer 1 · 2018-11-05

As others said in the comments above, pass all bam files to featureCounts, then its output will contain all counts in one file, which is simpler to parse. However, there should be no differences in counts, regardless you pass one file at a time or all files together.

I don't understand which differences between your and the tutorial counts are puzzling you: 1) the actual counts; 2) the column names; 3) the missing length column.

The actual counts are different because you and the tutorial are analysing different data sets: cancer cell lines (C33A and HeLa) versus "basal stem-cell enriched cells (B) and committed luminal cells (L) in the mammary gland of virgin, pregnant and lactating mice". By the way, are you using a human or mouse reference genome?

Column names can be easily changed with colnames( seqdata ) <- c( "name1", "name2", "name3" ).

And the length column can be additionally parsed from the featureCounts output file.