Biostar Beta. Not for public use.
Get intronic and intergenic sequences based on gff file
Entering edit mode
2.5 years ago
biolab ♦ 1.1k

Dear all

I have a gff3 file with header shown below. The 3rd column indicates feature (exon, intron, UTR),the 4th and 5th columns indicate start and stop position, and the last column indicates detailed feature (eg which exon). I aim to extract intergenic and intronic sequences based on this file.

Note many genes overlap.How to achieve my goal?

Chr1 MSU_osa1r7 gene 2903 10817 . + . ID=LOC_Os01g01010;Name=LOC_Os01g01010;Note=TBC%20domain%20containing%20protein%2C%20expressed
Chr1 MSU_osa1r7 mRNA 2903 10817 . + . ID=LOC_Os01g01010.1;Name=LOC_Os01g01010.1;Parent=LOC_Os01g01010
Chr1 MSU_osa1r7 exon 2903 3268 . + . ID=LOC_Os01g01010.1:exon_1;Parent=LOC_Os01g01010.1
... ...
GFF • 8.0k views
Entering edit mode

Hello biolab!

Questions similar to yours can already be found at:

We have closed your question to allow us to keep similar content in the same thread.

If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.


Entering edit mode

Questions like biolab's get asked again and again in many different versions. I think my answer below was specific to his exact questions (or at least one interpretation of it) which is not posed in your link.... and gave him specific advice on how to proceed. I certainly would not have answered as I did to that other question which was in some ways broader.

Entering edit mode

The question I linked to answers a very similar (albeit more general) question. I'm typically in favor of pointing users to these so that useful answers don't get spread across different threads.

However, I think I misread the question when I closed it: I thought it said intronic and exonic sequences, rather than intergenic sequences. I agree that inferring the intergenic sequence adds a dimension to this question that warrants a separate thread, so I'll reopen.

Entering edit mode

Hi Daniel, thank you very much for the link provided.

Entering edit mode
3 months ago
ATpoint 17k

Question is a few years old, but still relevant:

Intergenic is the easiest, as it is simply the complement of all features in a GTF/GFF file with the rest of the genome.

Note: This definition of intergenic is based purely on the GFF file entries. Of course there are promoters and other regulatory regions in the genome, with e.g. promoters right upstream of the first exon, but this is not annotated in a typical GFF so here, intergenic is simply the complement of everything in the GFF. If one wants to include promoters, one could define a certain window upstream of the gene start coordinate.

Requires the GFF/GTF and a BED file with the chromosome sizes. The latter, you can get based on the chromSizes files for your genome, e.g. from UCSC. It contains two columns, the first is the chromosome name, the second the number of basepairs on that chr. Make a BED file out of it:

awk 'OFS="\t" {print $1, "0", $2}' chromSizes.txt | sort -k1,1 -k2,2n > chromSizes.bed

Sort the GFF file:

cat in.gff | awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k4,4n -k5,5n"}' > in_sorted.gff

Get intergenic regions:

bedtools complement -i in_sorted.gff -g chromSizes.bed > intergenic_sorted.bed

Next, intron is the complement of intergenic and exonic regions. Note: This definition of intron is (IMHO) sufficient, because the other features of a GFF (CDS, UTR, Start/Stop Codon) are all sub-intervals of "exon". Hence, everything that is not intergenic or exon must be intron. First, extract exonic coordinates in BED format:

awk 'OFS="\t", $1 ~ /^#/ {print $0;next} {if ($3 == "exon") print $1, $4-1, $5}' in_sorted.gff > exon_sorted.bed

Now use BEDtools complement to get the introns:

bedtools complement -i <(cat exon_sorted.bed intergenic_sorted.bed | sort -k1,1 -k2,2n) -g chromSizes.bed > intron_sorted.bed

This of course can be customized based on the information in your GFF. Again, as mentioned above, if you want intergenic to be more precise, you could add custom features such as promoters, enhancers or other regulatory elements to the GFF, and then again take the complement of this file with the rest of the genome.

Edit 08/18: just learned that there is a nice functionality in R to get introns. Check this post.

Entering edit mode
15 months ago
Malcolm.Cook ♦ 1.0k
kansas, usa

Well, your problem is underspecified, since you don't say how you want to handle overlapping genes or transcripts.

However, in your toolkit you might want to have bedtools.

For example, if by "intergenic" you want all intervals that are not spanned by any gene, then you will want to use pull out just the gene entries from your gff3 (iand then use bedtools complement which "returns all intervals in a genome that are not covered by at least one interval in the input BED/GFF/VCF file." (

You can then get the fasta for the resulting gff using bedtools getfasta which" extracts sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file" (

Similarly, if you require that intronic regions must be introns in each and every transcript which spans it, you can extract your 'gene' and 'intron' features into separate gene.gff and intron.gff files, respectively, then use bedtools merge on each to create genic.gff and intronic.gff, and then use bedtools subtract to subtract the genic regions from the intronic regions.

If your definition of intronic and intergenic are otherwise, you'll probably find a way to compute them using bedtools.

Entering edit mode

Hi, Malcolm, thanks a lot for your helpful suggestions!

Entering edit mode

Can you tell me how to put all intron sequences to a single file (specify each intron of one gene) using a genome sequence and a GTF/GFF file? Thank you very much.


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3