Average intron length in Drosophila melagnoster
5
0
Entering edit mode
6.7 years ago
EVR ▴ 610

Hi All,

What is average or median intron length in Drosophila melagnoster genome. Is it 65 nts long? kindly guide me

Thanks in advance.

Drosophila Intron_length • 2.9k views
ADD COMMENT
4
Entering edit mode
6.7 years ago
Malcolm.Cook ★ 1.5k

A little R/BioConductor code will tell you the median is 102 and the mean is 1609:

> library(TxDb.Dmelanogaster.UCSC.dm6.ensGene)
> library(GenomicRanges)
> i<-intronsByTranscript(TxDb.Dmelanogaster.UCSC.dm6.ensGene)
> i<-unlist(i)
> summary(width(i))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      2      63     102    1609     742  257022
ADD COMMENT
0
Entering edit mode

Thanks for the code, do you know how to get these statistics but per transcript ?

so average intron length per transcript for instance ?

ADD REPLY
0
Entering edit mode

I'm not sure of what interest these summaries are to you, but, just call summary on the widths of each element of i., like this:

t(do.call(cbind,lapply(i,function(i) summary(width(i)))))

        Min.   1st Qu.   Median         Mean   3rd Qu.   Max.
1         76     76.00     76.0     76.00000     76.00     76
2         76     76.50     77.0     77.00000     77.50     78
3        112    112.00    112.0    112.00000    112.00    112
4         56     56.00     56.0     56.00000     56.00     56
...
ADD REPLY
1
Entering edit mode
6.7 years ago

Untested, but I think it should work, in theory:

Retested with the updated link from genomax:

$ wget -qO- ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/fasta/dmel-all-intron-r6.16.fasta.gz | gunzip -c - | awk '{ />/ && ++a || b += length(); } END { print b/a; }' 
1639.2

The awk one-liner is pulled from: Code Golf: Mean Length Of Fasta Sequences

ADD COMMENT
0
Entering edit mode

Works when ftp is added before flybase.net in the URL above.

$ wget -qO- ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/fasta/dmel-all-intron-r6.16.fasta.gz | gunzip -c - | awk '{ />/ && ++a || b += length(); } END { print b/a; }'
1639.2
ADD REPLY
0
Entering edit mode
6.7 years ago
aka001 ▴ 190

You can start by looking at the UCSC table browser and choose your gene model. Then just take all the intron lengths from the results and get the median.

ADD COMMENT
0
Entering edit mode
6.7 years ago

Download intron fasta file from ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/fasta/dmel-all-intron-r6.16.fasta.gz, as posted by Alex Reynolds and run following command:

zgrep \> dmel-all-intron-r6.16.fasta.gz | awk -F '[;=]' '{print $14}'  | datamash mean 1 median 1

ouput:

1639.2038965026 98

1639.2038965026 is mean and 98 is median. Datamash CLI based GNU tool available in most of the repos.

ADD COMMENT
0
Entering edit mode

withou awk:

 zgrep  \> dmel-all-intron-r6.16.fasta.gz | cut -d '=' -f8 | tr -d ' ;' | datamash -s mean 1 median 1
ADD REPLY
0
Entering edit mode

with seqkit:

seqkit stats dmel-all-intron-r6.16.fasta.gz

output:

file                                            format  type  num_seqs      sum_len  min_len  avg_len  max_len
../example_data/dmel-all-intron-r6.16.fasta.gz  FASTA   DNA     71,654  117,455,516        2  1,639.2  268,107
ADD REPLY
0
Entering edit mode
6.4 years ago

Download source file: dmel-all-intron-r6.18.fasta.gz from : ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/fasta.

per gene intron median and mean length:

$ zgrep \> dmel-all-intron-r6.18.fasta.gz | awk -F '[;=]' '{print $6,$14}' | awk -F '[, ]' -v OFS="\t" '{print $1,$NF}' | datamash -s -g1 median 2 mean 2

ouput (tail):

FBgn0285950 155 155
FBgn0285952 151 263
FBgn0285954 177 611.2
FBgn0285955 3692    6778.4117647059
FBgn0285958 71  68
FBgn0285962 69  1625.1111111111
FBgn0285963 880 3596.4705882353
FBgn0285970 563 563
FBgn0285971 58  58
FBgn0285991 59.5    147.5

per transcript intron median and mean length

$ zgrep \> dmel-all-intron-r6.18.fasta | awk -F '[;=]' '{print $6,$14}' | cut -f2- -d"," | awk -F " "  '{split($1,a,",");for(i in a)print a[i]"\t"$2}'| datamash -si -g1 mean 2 median 2

output (tail):

FBtr0472917 177.6   60
FBtr0472918 216.33333333333 206
FBtr0472919 55  55
FBtr0472920 129.5   129.5
FBtr0472921 227.33333333333 202
FBtr0472922 387.5   387.5
FBtr0472923 863 863
FBtr0472955 209.33333333333 73
FBtr0472956 87.666666666667 73
FBtr0472957 65  63

more information: datastat is available in most of the linux repos. Ouput here has mean (average), standard deviation, 1st quartile, median, minimum, maximum and number of transcripts (in that order). Unfortunately output order is hard coded.

$ zgrep \> dmel-all-intron-r6.18.fasta | awk -F '[;=]' '{print $6,$14}' | cut -f2- -d"," | awk -F " "  '{split($1,a,",");for(i in a)print a[i]"\t"$2}'| datastat --cnt --dev --1qt --med --min --max  -k 1  | head
# avg dev 1qt 2qt min max cnt
FBtr0005088 272.333 297.499 61 64 59 701 6
FBtr0006151 254.2 246.737 65 70 51 659 5
FBtr0070000 406.75 676.554 63 69 60 2109 8
FBtr0070003 103 50 53 103 53 153 2
FBtr0070006 1144.8 1180.26 284 361 108 3171 5
FBtr0070007 92.3333 33.7672 68.5 71 66 140 3
FBtr0070008 70.3333 11.8977 62 64 60 87 3
FBtr0070025 275.333 137.679 217 352 82 392 3
FBtr0070026 57 0 57 57 57 57 1
ADD COMMENT

Login before adding your answer.

Traffic: 1985 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6