Question

how to extract counts from bam/merge-bam without gtf

0

Entering edit mode

5.4 years ago

Shahzad ▴ 30

I have aligned my RNA-seq PE on Transcriptome with no gtf file available using Bowtie2.

Used Samtools to make bam files and sorted.

Used sambamba merge to merge the sorted bam replicates and resorted using samtools.

bowtie2 -x ../refrence-transcriptome -1 rep1_1.fq -2 rep1_2.fq -S sample1.sam --no-unal 
samtools view -Sb 1.sam > 1.bam
samtools sort 1.bam -o 1_sorted.bam
samtools index 1rc_sorted.bam
sambamba merge mergesamples.bam 1.bam 2.bam 3.bam

I want to know what tools i can use to get counts with no gtf/gff/gff3 available and only have ref.fa?

RNA-Seq R bowtie2 samtools Hisat2 • 3.4k views

ADD COMMENT • link updated 5.4 years ago by michael.ante ★ 3.8k • written 5.4 years ago by Shahzad ▴ 30

score 1 · Answer 1 · 2018-11-08

A couple of things:

You should align your RNA-seq data to the genome rather than the transcriptome, but this only makes sense of you have a GTF and splice sites. Given you had these, also use a splice-aware aligner such as hisat2 or star.
Merging replicates (given these are biological replicates) is not recommended as you lose any information about biological variability between the samples.

EDIT: See also michael.ante's answer below on why alignment is an inferior choice here.

I you situation, given the absence of a GTF, I would use salmon to quantify your reads directly against your transcriptome, which you seem to have in fasta format. Salmon is a alignment-free (pseudoaligner) tool with extensive documentation. Spend some quality time reading it and come back in case of questions. Do the quantification for each fastq files separately, given these are biological replicates.

score 1 · Answer 2 · 2018-11-08

1

Entering edit mode

5.4 years ago

michael.ante ★ 3.8k

Hi Shahzad,

If you have only the transcriptome annotation, I'd go for Salmon .

In case your species has multiple transcript per gene, you might run into troubles with the multimapping reads. Per default bowtie2 reports one alignment per read(-pair). Even if you have several with the same score. Thus, if several transcripts of a gene share an exon, all reads aligning on this exon will be assigned "randomly" to just one transcript.

If your species has only one transcript per gene, you can try to use samtools idxstats on your bam files to get the alignments per transcript.

Cheers,

Michael

ADD COMMENT • link 5.4 years ago by michael.ante ★ 3.8k

0

Entering edit mode

how to proceed further with idxstat output. I have done it but I thing I have done any mistake DESeq did not accept it. Can you please guide about how to format the output tables for DESeq or EdgeR?

ADD REPLY • link 5.4 years ago by Shahzad ▴ 30

1

Entering edit mode

With samtools idxstat, you get a table of the contig name, the contig's length, the number of alignments, and the number of unmapped reads. Having such a table in R. you only need to select the first column as row names and the third column as the counts. You also might discard the last row.

ADD REPLY • link 5.4 years ago by michael.ante ★ 3.8k