Hello Biostars members,
I am analysing my RNA-seq data (20 M PE, 150 bp) using STAR (2.5V) to map the trimmed reads and "bedtools multicov" tool to obtain the transcript counts. After generating the genome index (the genome and GFF are available), I used the following STAR script to map my libraries:
STAR --runMode alignReads --runThreadN 20 \
--genomeDir /Indexs/ --readFilesIn F.paired.1.fastq R.paired.2.fastq \
--bamRemoveDuplicatesType UniqueIdentical --outFilterMultimapNmax 1 \
--outBAMsortingThreadN 6 --quantMode GeneCounts \
--outSAMtype BAM SortedByCoordinate
The mapping seems to be ok according to the Log.final.out files that shows 90-88% of mapped reads (18M ) and average mapped length was 266 bp in the different libraries. After that, I used samtools index
to create the index of the STAR output files Aligned.sortedByCoord.out.bam
. Finally, I used:
bedtools multicov -bams -bed filtered.gff3 > TranscriptCounts.txt
Including after -bams
the different paths to the different .out.bam
files
When I check the TranscriptCounts file, the total number of counts in each library is close to 36 M, which is almost twice the number of mapped reads. I have check the sorted and indexed bam files as well as the GFF file and they seem to be ok. Even, I have run STAR to generate unsorted BAM files as output and after that I made the sort and index steps and the result was the same.
Thanks a lot for your help
All the best
Jose
Don't use
multicov
. It has its applications but RNA-seq is none of it. Use dedicated tools such asHTseq
,featureCounts
or a leightweight quantifier such assalmon
orkallisto
to obtain a count matrix from your alignment. The main point is probably what i.sudbery said below with respect to the gff.Thanks for your quick answers and comments !
I will try HTseq and Featurecounts to obtain the count matrix instead of bedtools multicov
Regarding Ian Sudbery's comment about gff3 filter, I used only gene lines, avoiding the problem with the isoforms. So, it could be a problem with bedtools multicov sensitivity on gene count process.
All the best
Thanks for your help
Jose
Please do not add an answer unless it is an actual answer to the top level question. Also, please use the formatting bar (especially the
code
option) to present your post better. You can use backticks for inline code (`text` becomestext
), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.If you want to be free of all that trouble, simply use a quantifer like
salmon
orkallisto
to directly quantify your fastq files against a reference transcriptome in fasta format. Isoforms do exist and simply removing them is probably oversimplifying the reality. Most protein-coding genes have > 1 isoform which influence the overall gene expression level.salmon
will take care of that all, give it a try.