Distribution Of Gene Size
2
4
Entering edit mode
12.0 years ago
Ying W ★ 4.2k

I was plotting things today and I found a funny peak in the distribution of gene size and I was wondering if any of you have encountered something similar.

gene size distribution

This is what I did:

  1. download iGenomes UCSC hg19
  2. under Annotation/genes there is a genes.gtf
  3. take all exons and create bed file with most 3' exon location and most 5' exon location
  4. plot geneend-genestart from the bed file created at step 3 (image above)

I used log scale since there is a big tail, I wanted to be able to visualize it better. The peak around 2^25 is ~28.5Mb and I do see quite a few genes with that size. I was thinking I might have done step3 wrong but I couldn't find anything wrong w/the code

The code for plotting is as follows

ee = read.table('exons.bed', sep="\t")

hist(log2(ee[,3]-ee[,2]), breaks=1000, main="log length of gene body")
gene • 7.2k views
ADD COMMENT
11
Entering edit mode
12.0 years ago
brentp 24k

Looks like something with your annotations. Here's the same thing grabbing sizes from UCSC:

mysql --user=genome -N --host=genome-mysql.cse.ucsc.edu -A -D hg19  -e "select txEnd - txStart from knownGene"  > sizes.txt

Then plotting in R:

hist(log2(read.table('sizes.txt')$V1), breaks=1000)

gives

enter image description here

ADD COMMENT
1
Entering edit mode

thanks, i need to learn the ucsc tables one of these days.

My problem was that I did not notice that the gtf file from iGenomes included the alternative chromosomes and genes that map to them, thus I was combining exons in genes across different chromosome and that gene_ids are not unique

ADD REPLY
0
Entering edit mode

Out of interest, anyway to represent the x-axis as number of basepairs?

I'm trying to see a distribution of genes and the lengths (measured in base pairs)

ADD REPLY
0
Entering edit mode

If you use base pairs instead of log(base pairs) as x-axis, the plot will be skewed to the right and less informative.

ADD REPLY
0
Entering edit mode
12.0 years ago
Wen.Huang ★ 1.2k

could it be duplicate entries of the same gene (but different isoforms, etc) in your table?

ADD COMMENT
0
Entering edit mode

I don't think so, I collapsed the gtf file by gene name so I only have one entry per gene. I looked at the top/bottom 50 genes listed and there did not seem to be a pattern in the name. I'm actually thinking now that it looks that way based on how I plotted it. ~200 genes out of 20k genes or ~1% of human genes fall in a very narrow window of gene size

ADD REPLY

Login before adding your answer.

Traffic: 1932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6