Question

What GTF file should I use in htseq?

0

Entering edit mode

6.3 years ago

statfa ▴ 760

Hi biostars

I used Plyester package to simulate some RNA seq data. Polyester provides simulated RNA reads FASTA files from a reference FASTA as output. For example, lets assume that I want to simulate 100 transcript. So I use the reference FASTA file and choose its first 100 transcripts using

> fatsa[1:100]         #fasta is the reference FASTA file imported in R

Then I use HISAT to align my simulated RNA read FASTA files to reference genome (for example hg38) and htseq to obtain the read counts. But the problem is the read counts I'm given are a table of all transcripts in the GTF file. And also not all the transcript I had first chosen have been counted. To clarify what I mean, look at the example below

# transcript I want to simulate:
    ENST00000456328.2
    ENST00000456330.1
    ENST00000457025.1

# And this is the table of read counts I'm provided with:

    ENST00000456328.2        100
    ENST00000454675.1         50
    ENST00000456330.1          0
    ENST00000545455.2         70
    ENST00000457025.1          0

As you see I have been provided with more transcripts that I wanted to simulate and also the transcripts I wanted to sim have 0 counts but the transcripts I didn't want to simulate have counts of 50 and 70

I think the problem is that I should take a subset of GTF file with the first 100 transcript as the ones I had chosen for simulation and then use this subset of GTF file for obtaining read counts using htseq. Is that correct?

Do I need to have a subset of reference genome (hg38) for alignment of my simulated fasta files to bam files too?

Thanks

hisat htseq polyester RNA-Seq • 2.6k views

ADD COMMENT • link updated 6.3 years ago by GenoMax 141k • written 6.3 years ago by statfa ▴ 760

1

Entering edit mode

If you only want to simulate 100 transcripts then cut your transcript file that number. When polyester generates reads, it is going to do that using some model its code uses (you are a statistician so you should be able to dive into that code and check). Otherwise you have to generate a counts table yourself and THEN ask polyester to generate the reads.

Ideally the simulated reads should align only to those 100 transcripts but some may align elsewhere if you use the full genome/transcriptome when aligning.

ADD REPLY • link 6.3 years ago by GenoMax 141k

0

Entering edit mode

Thank you. No, actually I want to simulate 10000 transcripts. 100 was just an example to clarify my question. I did cut my reference FASTA file but I didn't know if I should cut my gtf file to that number of transcripts when I'm gonna use htseq to obtain counts.

then cut your transcript file that number.

By transcript file you mean the GTF file?

I didn't understand why I should check Polyester's code. Polyester provides me with simulated RNA reads of the transcripts I choose.

ADD REPLY • link 6.3 years ago by statfa ▴ 760

1

Entering edit mode

Since you don't need the GTF file for read simulation I meant trim the original transcript fasta to 100 sequences. You could use a small GTF with only those 100 transcripts when you do the counting to speed things up.

I didn't understand why I should check Polyester's code.

If you want to prevent it from giving you 0 counts for some transcripts. You may need to change polyester code to prevent it from giving you zero counts. As @h.mon said below real datasets will have zero counts for genes so that is how polyester is modeling things.

ADD REPLY • link 6.3 years ago by GenoMax 141k

1

Entering edit mode

Real RNAseq data has plenty of zero-count transcripts. How much will depend on various factors, such as homogeneity or heterogeneity of starting material (e.g cell line vs tissues), library prep kit (e.g. whole RNA vs poly-A selection), and so on. I don't know how Polyester simulates the reads, but if is trying to perform realistic simulations, it will include a sizeable proportion of zero-count transcritps.

ADD REPLY • link 6.3 years ago by h.mon 35k

0

Entering edit mode

This is how polyester works:

"Polyester is an R package designed to simulate RNA sequencing experiments with differential transcript expression.

Given a set of annotated transcripts, Polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads. Simulated reads can be analyzed using your choice of downstream analysis tools.

Polyester has a built-in wrapper function to simulate a case/control experiment with differential transcript expression and biological replicates. Users are able to set the levels of differential expression at transcripts of their choosing. This means they know which transcripts are differentially expressed in the simulated dataset, so accuracy of statistical methods for differential expression detection can be analyzed. Polyester offers several unique features:

Built-in functionality to simulate differential expression at the transcript level

Ability to explicitly set differential expression signal strength

Simulation of small datasets, since large RNA-seq datasets can require lots of time and computing resources to analyze

Generation of raw RNA-seq reads, as opposed to alignments or transcript-level abundance estimates

Transparency/open-source code"

As you see this package isn't good for big simulations. So maybe I simulate less transcript (for example 3000 transcripts)

But my main problem now is that when I align the simulated RNA reads FASTA files using hisat and then obtain read counts using htseq, about 70 percentage of my simulated transcripts are counted zero. Which is very disappointing to me. How can I use this simulated count table for my analysis when 70% of transcripts are zero.

And by the way, I defined the number of reads to be a function of transcripts length so longer transcripts would provide more reads.

So that's why I wonder why most of transcripts are counted zero.

ADD REPLY • link 6.3 years ago by statfa ▴ 760

1

Entering edit mode

As you see this package isn't good for big simulations.

No that is not what the quote above says. It is only saying that larger datasets require significant compute resources (as you have already discovered) so polyester is making it possible to simulate small datasets.

Which is very disappointing to me.

So provide a count matrix you like and then have polyester generate reads from it. I am not entirely sure why you are not taking this route.

How can I use this simulated count table for my analysis when 70% of transcripts are zero.

See my comment immediately above. Make up a matrix you like.

ADD REPLY • link 6.3 years ago by GenoMax 141k

0

Entering edit mode

Thank you so much @genomax. That is what I did. I generated a count table and gave it to polyester to generate reads based on that table. For example, the element i,j=200 of this count table would tell polyester to generate 200 reads for transcript i replicate j. And all the elements of the count table have been reasonably large. So that's why I don't expect to be given 0 counts when I obtain the count table of simualted rna reads using downstream analysis including hisat and htseq. Maybe there is something wrong about my pipeline that I don't know.

ADD REPLY • link 6.3 years ago by statfa ▴ 760

1

Entering edit mode

Are you simulating reads that are long enough to allow specific mapping? If reads multi-map then counting programs by default ignore those reads and they are not counted. This is done because in real life you can't be sure of the spot in the genome a read originated from, if it is mapping equally well at more than one spot.

This may be the reason why you are seeing 0 counts in your table. Enable the option in htseq-count (I think that is --nonunique all) to count multi-mapped reads and see if those 0 counts go away.

ADD REPLY • link 6.3 years ago by GenoMax 141k

0

Entering edit mode

I simulated the first 100 transcripts of the reference FASTA. htseq has this options: union, intersection-strict, and intersection-nonempty. As you suggested, I tried the least strict option which is intersection-nonempty. The number of aligned reads has increased but the number of transcripts with zero counts are still the same. I think I should trim the GTF file to those transcripts I chose to simulate.

__no_feature 764

__ambiguous 1012

__too_low_aQual 1

__not_aligned 616

__alignment_not_unique 28245

What do you think?

ADD REPLY • link 6.3 years ago by statfa ▴ 760

1

Entering edit mode

Can you post the exact command you used for htseq-count? I am more familiar with featureCounts (http://subread.sourceforge.net/ ) where the equivalent option to count multi-mapping reads is -M.

ADD REPLY • link 6.3 years ago by GenoMax 141k

0

Entering edit mode

I used htseq in usegalaxy.org. featureCounts? let me give it a try.

ADD REPLY • link 6.3 years ago by statfa ▴ 760

0

Entering edit mode

Wow. Thank you. FeatureCounts sounds to align reads only to my chosen transcripts. How? I don't know. But exactly my chosen transcripts have the counts. Other transcripts are counted zero except one. Now only 20% of my transcripts are counted zero and one unsimulated transcript is counted nonzero.

ADD REPLY • link 6.3 years ago by statfa ▴ 760

0

Entering edit mode

Glad to hear that. Now on to the next step of simulating those 10000 transcripts.