Question

Does STAR give the reads that were mapped at intergenic regions? How can I get the regions and count reads?

0

Entering edit mode

6.3 years ago

valopes ▴ 30

Hi,

I was wondering if Star does the intergenic mapping as well? If it does, is it included in the counting as mapped? How can I get the position info and the reads counts for it?

Thanks in advance.

RNA-Seq • 5.4k views

ADD COMMENT • link updated 6.3 years ago by Antonio R. Franco ★ 5.1k • written 6.3 years ago by valopes ▴ 30

score 2 · Answer 1 · 2018-01-04

was wondering if Star does the intergenic mapping as well?

Yes, it does map reads wherever they align. In the stats, it counts mapped reads irrespective of location. The annotated splice-sites are used as 'hints' only, splice sites are differentiated in the Log.final.out file (example).

How can I get the position info and the reads counts for it?

If you need the number of reads mapping to annotated transcript, you should use a tool like featureCounts or HTseq-count on your BAM files. If you have a novel transcript you want to test, you can insert it into the GFF/GTF file before aligning/counting (if there are no introns in the gene you can omit a new genome indexing alignment, most likely it will be fine anyway with only re-counting).

If you have a cDNA sequence, you could use gmap to build the GFF code for your new gene by aligning it to the genome, and insert it into the annotation file.

If you are interested to look at only a few regions, you can use samtools view or samtools tview.

                              Started job on |       Sep 14 13:54:30
                         Started mapping on |       Sep 14 13:54:30
                                Finished on |       Sep 14 13:56:42
   Mapping speed, Million of reads per hour |       150.09

                      Number of input reads |       5503468
                  Average input read length |       76
                                UNIQUE READS:
               Uniquely mapped reads number |       5035623
                    Uniquely mapped reads % |       91.50%
                      Average mapped length |       75.64
                   Number of splices: Total |       554208
        Number of splices: Annotated (sjdb) |       439322
                   Number of splices: GT/AG |       537158
                   Number of splices: GC/AG |       1407
                   Number of splices: AT/AC |       137
           Number of splices: Non-canonical |       15506
                  Mismatch rate per base, % |       0.40%
                     Deletion rate per base |       0.02%
                    Deletion average length |       1.55
                    Insertion rate per base |       0.01%
                   Insertion average length |       1.28
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |       321253
         % of reads mapped to multiple loci |       5.84%
    Number of reads mapped to too many loci |       0
         % of reads mapped to too many loci |       0.00%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |       0.00%
             % of reads unmapped: too short |       2.66%
                 % of reads unmapped: other |       0.00%
                              CHIMERIC READS:
                   Number of chimeric reads |       0
                        % of chimeric reads |       0.00%

score 1 · Answer 2 · 2018-01-04

1

Entering edit mode

6.3 years ago

Hussain Ather ▴ 990

If you want to find intergenic regions with STAR, you can add fake exons to simulate intergenic space to your GTF file. You could make ~100b spacing between genes and these exons so there could be some overlap. (source: https://groups.google.com/forum/#!topic/rna-star/W77DYn4Cosk)

ADD COMMENT • link 6.3 years ago by Hussain Ather ▴ 990

score 1 · Answer 3 · 2018-01-04

STAR does the mapping throughout all the genome, and places the reads from they derive. If you have reads fron intergenic regions, they will be mapped there..

I will think into several approaches

FIRST ONE Map to the whole reference genome What you need is a reference gtf or gff file containing the coordinates of your intergenic regions. Thus, you can use a program like htseq-count (under Linux) or featurecounts (under R) to extract your counts from these regions. You will have more or less problems to get a reference depending upon if you are working on a model organism included or not in biomart, because with biomart is very easy to get any desirable region from your genomes

If your organism is not included into biomart, you need to play with the gtf or gff file to get rid of cds and intron regions, and then use the file for the htseq-count program. I will be thinking about bedtools to do so, even though I never hass attempted to do that yet. There are programs to convert a gff or gtf file to a bed file, and bedtools can give you the coordinates of the intergenic regions

SECOND ONE Map reads to transcripts only. (to the CDS with transcripts) This will give you two files, those that are mapping to the transcripts, and those that does not. The second file should be full of intergenic reads with their coordinates

EDIT: see this