Question

Convert intron coordinates to gene names

0

Entering edit mode

4.7 years ago

waqaskhokhar999 ▴ 160

I have a huge list of intron coordinates present in the Arabidopsis thaliana genome, I want to get the gene names based on intron coordinates.

Sampe itron coordinates:

chr start   end
1   20259962    20260494
1   20634981    20635060
1   18880866    18881011

Desired output:

chr start   end Name
1   20259962    20260494    AT1G54270
1   20634981    20635060    AT1G55320
1   18880866    18881011    AT1G50950

This thread (Find Out The Genes That Correspond To My Coordinates) is very close but I wasn't able to get the desired output.

RNA-Seq sqtl • 1.8k views

ADD COMMENT • link updated 4.7 years ago by Alex Reynolds 35k • written 4.7 years ago by waqaskhokhar999 ▴ 160

0

Entering edit mode

but I wasn't able to get the desired output.

What exactly went wrong? Did you try all the solutions in that thread?

ADD REPLY • link 4.7 years ago by Ram 43k

1

Entering edit mode

4.7 years ago

Pierre Lindenbaum 161k

use bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html to build a BED file of intron + name.

$ wget -q  -O - "ftp://ftp.ensemblgenomes.org/pub/release-44/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.44.gtf.gz" |\
gunzip -c |\
java -jar dist/bioalcidaejdk.jar -F GTF -e 'stream().flatMap(G->G.getTranscripts().stream()).flatMap(T->T.getIntrons().stream()).forEach(I->println(I.getContig()+"\t"+(I.getStart()-1)+"\t"+I.getEnd()+"\t"+I.getTranscript().getGene().getGeneName()));' |\
sort -t $'\t' -k1,1 -k2,2n | uniq > introns.bed

$ head introns.bed
1   3913    3995    NAC001
1   4276    4485    NAC001
1   4605    4705    NAC001
1   5095    5173    NAC001
1   5326    5438    NAC001
1   7069    7156    ARV1
1   7232    7383    ARV1
1   7450    7563    ARV1
1   7649    7761    ARV1
1   7649    8235    ARV1

then use bedtools intersect to get the intersection with your files.

ADD COMMENT • link 4.7 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

4.7 years ago

ATpoint 82k

If you have a GTF file simply filter for entries with the "gene" feature and then use bedtools closest which will return 0 as distance in case of an overlap. From there you can awk out the format you want. It is unfortunate you have no strand information, so you will have multiple genes sometimes per intron.

ADD COMMENT • link 4.7 years ago by ATpoint 82k

score 3 · Accepted Answer · 2019-08-22

You could get gene annotations for your assembly of Arabidopsis and convert them to BED with BEDOPS gff2bed:

$ wget -qO- https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff | awk '$3=="gene"' | gff2bed - > TAIR10.genes.bed

With TAIR10.genes.bed in hand, you can map your coordinates to these genes and get their ID.

Here's an example of ad-hoc region being mapped with BEDOPS bedmap:

$ echo -e 'Chr1\t20259962\t20260494' | bedmap --echo --echo-map-id --delim '\t' - TAIR10.genes.bed 
Chr1    20259962    20260494    AT1G54270

Notice the difference in chromosome names, here using the TAIR convention. To do multiple regions, you'll need to take your regions and get their chromosome names "TAIR-compatible" for querying, as well as sorting them with BEDOPS sort-bed:

$ tail -n+2 introns.txt | awk -vFS="\t" -vOFS="\t" '{ print "Chr"$0 }' | sort-bed - > introns.bed

Then you can map the sorted, BED-formatted introns:

$ bedmap --echo --echo-map-id --delim '\t' introns.bed TAIR10.genes.bed > answer.bed

If an intron overlaps multiple genes, then the ID column will contain a comma-delimited list of those gene names.

Otherwise, there will be either one name (for one overlap) or nothing (in the case of no overlaps). If you want to filter out introns that do not overlap a gene annotation, one can add the --skip-unmapped option:

$ bedmap --echo --echo-map-id --delim '\t' --skip-unmapped introns.bed TAIR10.genes.bed > answer.bed