How to get number of exons for each transcript in biomart
5
0
Entering edit mode
5.2 years ago

I would like to get a table from ensembl that includes the number of exons for each transcript. I can't find any options that correspond to this on biomart, though.

biomart • 4.3k views
ADD COMMENT
0
Entering edit mode

I don't think biomart would store such aggregate data. You should be able to use UCSC MySQL tables to write a custom query, if you're not specific about using EnsEMBL. If you need EnsEMBL, you might need to get the CDs information and count exons yourself.

ADD REPLY
0
Entering edit mode

In addition to RamRS suggestion, if you could covert GTF to exons, like for example using this, you can groupBy the transcript and count the number of exons per transcript.

ADD REPLY
0
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY
1
Entering edit mode
5.2 years ago
Benn 8.3k

I wouldn't use biomaRt, but try to use AWK instead. If you download the annotation gtf file from ensemble, you can try something like this with AWK:

awk '$3=="exon" {print $0}' Homo_sapiens.GRCh38.78.gtf | awk '{ count[$10]++ } END { for (word in count) print word, count[word]}' > numberOfExonsPerGene.txt
ADD COMMENT
0
Entering edit mode
5.2 years ago

ucsc (not biomart)

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -P 3306 -D hg38 -e 'select chrom,name,exonCount from wgEncodeGencodeBasicV28;'
+-------+-------------------+-----------+
| chrom | name              | exonCount |
+-------+-------------------+-----------+
| chr1  | ENST00000619216.1 |         1 |
| chr1  | ENST00000473358.1 |         3 |
| chr1  | ENST00000469289.1 |         2 |
| chr1  | ENST00000607096.1 |         1 |
| chr1  | ENST00000417324.1 |         3 |
| chr1  | ENST00000641515.2 |         3 |
| chr1  | ENST00000335137.4 |         1 |
| chr1  | ENST00000466430.5 |         4 |
| chr1  | ENST00000495576.1 |         2 |
| chr1  | ENST00000610542.1 |         4 |
| chr1  | ENST00000493797.1 |         2 |
| chr1  | ENST00000484859.1 |         2 |
| chr1  | ENST00000466557.6 |         8 |
| chr1  | ENST00000410691.1 |         1 |
| chr1  | ENST00000496488.1 |         2 |
| chr1  | ENST00000612080.1 |         1 |
| chr1  | ENST00000635159.1 |         2 |
| chr1  | ENST00000426406.3 |         1 |
(...)
+-------+-------------------+-----------+
ADD COMMENT
0
Entering edit mode
5.2 years ago

Or use R with a transcript database (you can make your own from any GTF using the makeTxDbFromGFF command from the GenomicFeatures library):

## setup transcriptDb
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene
## get exon locations for each gene
exons <- exonsBy(txdb,'gene')

## print number of exons for each gene - just look at the top 6 with 'head'
head(sapply(exons,length))
ENSMUSG00000000001 ENSMUSG00000000003 ENSMUSG00000000028 ENSMUSG00000000031 
                 9                  9                 24                 15 
ENSMUSG00000000037 ENSMUSG00000000049 
                41                 16
ADD COMMENT
0
Entering edit mode
5.2 years ago

I cannot see a a way either to directly get the number of exons in each transcript via biomart.

But you can choose Exon Stable ID as an attribute for an output. You can then use a simple awk script to count the exons per transcript. Let's assume you have choosen the attributes Gene stable ID, Transcript stable ID and Exon stable ID for the output. Make sure you tick the point "Unique results only" and download as a TSV.

$ awk -v FS="\t" -v OFS="\t" 'NR>1 {transcript[$2]++;} END { for(t in transcript) print t, transcript[t] }' mart_export.txt

This will create a list with the transcript name given in column 2 as the key, and count each line with this transcript number. At the end we iterate over the list and print each count. If the transcript id is in a different column in your output, change the $2 to whatever it is.

fin swimmer

ADD COMMENT
0
Entering edit mode

Thank you. This worked perfectly for what I needed to do.

ADD REPLY
0
Entering edit mode
4.5 years ago
asrmpr • 0

Exon Number Finder v.1

A tool to find genes of user-specific exon number

https://github.com/CyPH3R-ASR/exonNum

ADD COMMENT
0
Entering edit mode

Thanks for contributing, but please note:

1) this does not answer the toplevel question as OP asked about Biomart and

2) it is sufficient if you add your tool once, not as an answer and a comment. Removed the comment.

ADD REPLY

Login before adding your answer.

Traffic: 2717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6