Question

see reads in .bam file, no counts after HTseq

0

Entering edit mode

4.9 years ago

emilybowie2 • 0

Hi! VERY, VERY (so please be nice to me!) new to the field here, but briefly: performed RNAseq (mouse tissue), got the sequences (fastq format), cleaned them up (trimgalore), aligned them (RNASTAR to mm10) - checked alignments on IGV and can see sufficient reads across all exons for GOI, however when I use HTseq to make count files, GOI has no counts? I've been going through forum after forum trying to figure out where something went wrong, but can't figure it out - any suggestions/insights for investigation into this issue?

rna-seq counts htseq • 2.6k views

ADD COMMENT • link updated 4.8 years ago by h.mon 35k • written 4.9 years ago by emilybowie2 • 0

2

Entering edit mode

Please provide more detail:

copy / paste here the commands you used
genome and annotation versions were from same source and version?
could you provide a link to the GOI gene?

Read How To Ask Good Questions On Technical And Scientific Forums for guidance and reference for future questions.

ADD REPLY • link 4.9 years ago by h.mon 35k

1

Entering edit mode

Also is it just for 1 gene or is it for all genes you dont get any counts?

ADD REPLY • link 4.9 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

I appreciate the link for asking good questions - I'll elaborate the steps/commands:

Since I am a newbie, I uploaded all .fastq files to Galaxy - changed the file type to .fastqsanger before using Trim Galore. I used Trim Galore via Galaxy. My sequences were single-end, I chose automatic adaptor sequence trim, and left all other default things as default because...again, newbie. I ran FastQC on all samples, all samples looked good and had high quality scores.
- I also aligned the reads using RNA STAR in Galaxy. I used the built-in Mus musculus mm10 genome. I selected to NOT count number of reads per gene since I did this part later. I set length of the genomic sequence around annotated junctions to 100 (per someone across the hall suggestion), I left all other default settings as default which I'm seeing is a trend, and possibly I should look further into these settings to make sure I'm not missing something there.
- I downloaded these .bam/.bai files after alignment and checked on IGV viewer on the mm10 genome - this is where I can see my reads sufficiently for my GOI (Tautubulinkinase2: https://www.ncbi.nlm.nih.gov/gene/140810 - using this as GOI as it is the center of my experimental question, so I wanted to make sure it was in my dataset at this point. It is a lowly expressed gene throughout the animal, except in the brain/testes (though this expression is still low even in these tissues comparatively speaking), and I submitted brain RNA).
- Now for counts: initially I used FeatureCounts (again, on Galaxy...for someone like myself, this is a great tool!), selected the same Musmusculus mm10 genome from my history that I used for the annotation. Selected "stranded" for the input on strand specificity because that is what the core that gave me my samples did for library prep. Feature type: exon. Minimum bases of overlap: 1. Minimum mapping quality per read: 10. All other fancy options were default - seems the default is "no" for all additional things that featurecounts can do, again, I might need to revisit this. However the count output from featurecounts, did not have my GOI at this point, so I went to HTseq via that friend across the hall that mentioned it might be the better way to go about things. for HTseq, I followed the protcol I found on their website: https://htseq.readthedocs.io/en/release_0.11.1/count.html. I used my alignment files from RNA STAR, I downloaded this .gtf file for the genome: Mus_musculus.GRCm38.96.gtf.gz (this is essentially mm10 from ensembl), chose "union", and a minimum mapping quality per read of 10 (same as for featurecounts). Feature type: exon. Stranded: yes. These count files also do not have counts for my GOI. As kristoffer.vittingseerup asked below asked - I do have counts for 13000 other genes, just not my "control" gene in my control file (which this GOI has been manipulated in other samples that I want to compare to the control...so it's kind of important I can see counts for it in my control).

Wow that's a novel.

ADD REPLY • link 4.9 years ago by emilybowie2 • 0

2

Entering edit mode

Are you using the same transcript file for alignment and HTseq? Also, I would recommend you to use FeatureCounts for transcript quantification.

ADD REPLY • link 4.9 years ago by Arup Ghosh 3.2k

3

Entering edit mode

The core functionality of HTSeq and featureCounts is identical so I would have to argue there is no need to switch. If you want more accurate quantification you would need to use pseudo-alligners such as Kallisto or Salmon instead. Their main advantages are: 1) they can handle multimappers (featureCounts and HTSeq cannot) which usually are a large fraction of your data you otherwise ignore. 2) They have sequence bias correction algorithms.

ADD REPLY • link 4.9 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

I agree that Kallisto/Salmon can provide more accurate transcript quantification. But I was recommending featureCounts as it takes less time and resources to run.

ADD REPLY • link 4.9 years ago by Arup Ghosh 3.2k

score 0 · Answer 1 · 2019-06-23

I also aligned the reads using RNA STAR in Galaxy. I used the built-in Mus musculus mm10 genome. I selected to NOT count number of reads per gene since I did this part later.

No reason not to count with STAR, it is essentially a free analysis: it doesn't add significant run time, memory or disk usage, compared to just mapping. And the results are equivalent to featureCounts and HTSeq.

Selected "stranded" for the input on strand specificity because that is what the core that gave me my samples did for library prep.

The most common library preparation (Illumina TruSeq stranded) would result in a reverse-stranded library, so the correct setting for featureCounts on the command line is -s 2, and for HTSeq is -s reverse - I guess there should be a stranded: reverse option for Galaxy. You have to check the library preparation kit manual or ask the sequencing center staff for details.

I downloaded this .gtf file for the genome: Mus_musculus.GRCm38.96.gtf.gz (this is essentially mm10 from ensembl)

As arup hinted, you have to use genome and annotations from the same source, as the chromosome naming convention may be different between them (some sources use 1, 2, 3 and so on for chromosome names, while others use chr1, chr2, chr3). Check the bam header and the annotation gtf to see if chromosome names are equal.

Did you check the mapping rates (from STAR) and feature assignment rates (from featureCounts)? Do they seem correct? What proportion of your reads is mapped and what proportion is assigned to a feature? Did you check if the mappings you see at the gene of interest are multi-mappers? Multi-mapping reads are not counted by STAR / featureCounts / HTSeq. Then, as kristoffer.vittingseerup pointed out, you would need to use a different method (RSEM, Salmon or kallisto) to be able to quantify these reads.

Some additional comments:

Please define your abbreviations the first time you use them - I had no idea GOI meant gene of interest, I thought it was the name of a gene.

The issue of seeing mapped reads but having no counts is a recurrent one, and there are several posts on BioStars and other forums about it. If you search for the words on your post title, you will find several of them, do read them to check if a suitable solution has been posted elsewhere.

As you have been using Galaxy for most of your analyses, maybe you can ask your question there. Before doing so, follow up carefully on the suggestions here, and if none solves your question, then go ahead and post there. Please make sure to state here (edit your original question and add the information there) and at GalaxyHelp you cross posted, providing links between the posts.