Question

No aligned reads after featurecounts. BAM files have >85% reads mapped

0

Entering edit mode

4.9 years ago

LAB_2019 • 0

Hello, new to RNA-seq here. I have mapped my fastq reads to a reference genome (transcript.fasta) with HISAT2. The BAM files look very good with >85% reads mapped. However, when I try to use featurecounts, both local and on Galaxy, the results always yield 0 mapped reads. From lurking around the web, it seems there may be a disconnect between my BAM file Headers and the .GFF file that I am using in feature counts.

Example line of .GFF
Seqname Source  Feature Start   End Score   Strand  Frame   Group

Niben101Ctg00054    maker   exon    167 487 .   +   .   ID=Niben101Ctg00054g00001.1:exon:001;Parent=Niben101Ctg00054g00001.1;Alias=snap_masked-Niben101Ctg00054-processed-gene-0.0-mRNA-1:exon:4812

Example line of .BAM
A00247:109:GW190602181st_novaxp:1:1402:19651:8030   419 Niben101Ctg00109Ctg001  108 3   141M    =   120 151 CTTCAGACATCTTGGCATGGCATTTGACATGTATCAATCGTTGACACCATAAACAATACCAAGTTGGAGATGCATCAAGGAAAGGAACACCACAAGGTTCATCACAGTAGAAACAAAAGGCAGACATCTCAGGATTCTCAT   FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:141 YT:Z:UU NH:i:2 CC:Z:Niben101Scf04375Ctg112 CP:i:5129 HI:i:0

Here is a sample output summary:

Status  accepted_hits_Mock-1.bam
Assigned    0
Unassigned_Unmapped 0
Unassigned_MappingQuality   0
Unassigned_Chimera  7331650
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 10893853
Unassigned_Secondary    0
Unassigned_Nonjunction  0
Unassigned_NoFeatures   26615011
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    0

Is there any obvious discrepancy that could cause this? In featurecounts, I have used 'ID' as my gene identifier, not gene_id as it is not present in my .gff file. I'm hoping there is a simple error that is causing this complete failure.

Thanks!

RNA-Seq • 1.7k views

ADD COMMENT • link updated 4.9 years ago by lieven.sterck 15k • written 4.9 years ago by LAB_2019 • 0

0

Entering edit mode

How to add images to a Biostars post

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 4.9 years ago by WouterDeCoster 47k

0

Entering edit mode

That is indeed likely the most obvious issue.

From the example you provided I think the reference sequences IDs do not seem to correspond (naming schema-wise)

Niben101Ctg00054 <--->  Niben101Ctg00109Ctg001

can you double check if they are OK? Did the formatting of the index for the alignment went ok?

ADD REPLY • link 4.9 years ago by lieven.sterck 15k

0

Entering edit mode

The BAM files were generated using a "contigs" .fasta and a "transcripts" .fasta. The transcript format wasx as such:

Niben101_annotation.transcripts.fasta
>Niben101Scf03546g00006.1
TTTTTAAATATTAAATTCATAAAATTTAGACCTGG

Is there a way to edit the BAM or .GFF files to create uniformity between these files? All files were downloaded from a single source ftp://ftp.solgenomics.net/genomes/Nicotiana_benthamiana/ It seems odd that they would not work together for this analysis.

ADD REPLY • link 4.8 years ago by LAB_2019 • 0

0

Entering edit mode

hard to tell without having our hands on your data.

why are you using a (genome?) contig and a transcript file as reference? If you merged 2 files into one you have to make sure the IDs are unique over both files.

can you check if the transcript/contig ID I posted before does exist in your contig/transcript file? eg by doing grep Niben101Ctg00109Ctg001 <contig/transcript file>

ADD REPLY • link 4.8 years ago by lieven.sterck 15k

0

Entering edit mode

Thanks for the advice. I am going to re-align one test sample using only 1 .fa file. I will double check the Genome.fa and transcripts.fa to make sure they have the same ID's as my .GFF file. I am mainly trying to differential gene expression analysis. Would you recommend using only the transcripts.fa? Or should I only use the contigs.fa?

ADD REPLY • link 4.8 years ago by LAB_2019 • 0

0

Entering edit mode

what is the difference between the contigs and the transcript file?

ADD REPLY • link 4.8 years ago by lieven.sterck 15k

0

Entering edit mode

I simplified this. I re-ran HISAT2 with my paired-end fastq files and only the transcripts.fa file as a reference. here is the format of the transcripts.fa

>Niben101Scf03546g00006.1
TTTTTAAATATTAAATTCATAAAATTTAGA

This produced the .BAM file. Here is a sample for idxStats (had ~65% mapped reads)

1   2   3   4
Niben101Scf03546g00006.1    2081    98  9
Niben101Scf03546g00007.1    2438    1331    18

I then tried to use featureCounts on galaxy with the BAM file shown below:

A00247:109:GW190602181st_novaxp:1:1501:15817:30107  83  Niben101Scf03546g00007.1    452 60  141M    =   404 -189    AAGAGGAAGTGGTTGTGGAGACATCATTTGCACCTAAGAGTTTTCCGAGCAAAGTGGGAGGGGGGATTAATGGAGAGCCACCAGATGATTCATCCCCTAATGCTTTGGAGAAATGGGTTGTAAAGGTTGAGCAGTCTATAA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF   AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:141 YS:i:0 YT:Z:CP NH:i:1
A00247:109:GW190602181st_novaxp:1:1235:3531:33974   163 Niben101Scf03546g00007.1    453 60  140M    =   632 320 AGAGGAAGTGGTTGTGGAGACATCATTTGCACCTAAGAGTTTTCCGAGCAAAGTGGGAGGGGGGATTAATGGAGAGCCACCAGATGATTCATCCCCTAATGCTTTGGAGAAATGGGTTGTAAAGGTTGAGCAGTCTATAA    FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFF:FFFFF,FFFFFFFFFFFFFFFF:FFFFFFF::FFFF:FFFFFFFFFFFFFFFFFFF:F,FFFFF:FFFFFFFFFFFFFFFFF    AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:140 YS:i:0 YT:Z:CP NH:i:1

I used the following .GFF file

Niben101Ctg01237    repeatmasker    match   1   526 1713    +   .   ID=Niben101Ctg01237:hit:145937:1.3.0.0;Name=species:Copia-3_SL-I|genus:LTR%2FCopia;Target=species:Copia-3_SL-I|genus:LTR%2FCopia 300 825 +
Niben101Ctg01237    repeatmasker    match_part  1   526 1713    +   .   ID=Niben101Ctg01237:hsp:274063:1.3.0.0;Parent=Niben101Ctg01237:hit:145937:1.3.0.0;Target=species:Copia-3_SL-I|genus:LTR%252FCopia 300 825 +

I am so confused by the nomenclature. there are no chromosomes, sometimes there are contigs (Ctg) , sometimes there are what I assume scaffolds (Scf).

ADD REPLY • link 4.8 years ago by LAB_2019 • 0

2

Entering edit mode

And did this work better?

You have to absolutely make sure you are using the GFF files that corresponds to the genomic/transcript file you used to do the mapping. Is the GFF file linked to your genomic file or transcript file.

In case of mapping to the genome you will indeed need to use the GFF file, in case of transcripts you don't need to use it, you can then just count the number of reads mappping to each transcript (eg with idxStats), no need to use GFF/FeatureCount then.

ADD REPLY • link 4.8 years ago by lieven.sterck 15k

0

Entering edit mode

Thanks for your help. I think I accomplished what I needed to do with your suggestions. Here is what I did:

Mapped reads w/ HISAT2 to transcripts.fa
Performed idxStats on .BAM files to count #mapped reads
Constructed raw counts file and performed differential gene expression analysis

The .GFF file that is associated with the genome.fa is unusable. Since I am only interested in known transcripts, I think this was the simplest way forward.

ADD REPLY • link 4.8 years ago by LAB_2019 • 0

1

Entering edit mode

There is a small problem with using idxstats when mapping to the transcriptome. Depending on your organism (which I'm not familiar with), most genes have multiple transcripts overlapping the same sequence. Thus a very large proportion of your reads are likely to be multi-mapping.

If you use idxstats, directly on the BAM, then any read that comes from a portion of a gene that is covered by more than one transcript will be double counted. You could filter out the multi-mappers first, using samtools view, reindex, and then use idxstats again. However, my feeling is that you would probably loose a very large proportion of your reads.

Luckily people have thought long and hard about this problem, and there are algorithmns out their for counting reads mapped to transcriptomes to get an accurate quantification: salmon and RSEM. Both can run from a transcriptome BAM file (although the default for salmon is to run from a fastq) and use EM to allocate multimapping reads proportionally to the likely source transcripts.

ADD REPLY • link 4.8 years ago by i.sudbery 19k

0

Entering edit mode

Thanks for pointing this out. I repeated my analysis using Salmon. Overall, the general trends seem highly similar, but this is definitely a more accurate approach.

ADD REPLY • link 4.8 years ago by LAB_2019 • 0