Question

Why is the cds start and cds end position the same in some of the known-gene data downloaded from ucsc ? (ucsc table browser)

0

Entering edit mode

9.2 years ago

mlank001 • 0

I am working on human cancer data and wanted to extract sequences from the human genome for analyzing SNPs. For that, I am trying to fetch exon start and end positions for the human genome build GRCh37. I used ucsc genome browser to download this data (using known Gene table)

The data has the same values for cds start and cds end. I am confused as to why this is the case, as cds is for coding region. Should I ignore this and go use exon start and end positions?

Any help is highly appreciated!

Thanks!

UCSC-Genome-Browser • 6.0k views

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.2 years ago by mlank001 • 0

0

Entering edit mode

Please post example of "The data has the same values for cds start and cds end".

ADD REPLY • link 9.2 years ago by PoGibas 5.1k

0

Entering edit mode

Here is the example of one entry:

#na           chrom    strand    txStart    txEnd    cdsStart    cdsEnd    exonCount    exonStarts            exonEnds              proteinID      alignID
uc001aaa.3    chr1     +         11873      14409    11873       11873     3            11873,12612,13220,    12227,12721,14409,    uc001aaa.3

If you look at the 6th and 7th columns (cdsStart and cdsEnd), the values are the same. I am confused as to why that's the case.

ADD REPLY • link updated 23 months ago by Ram 43k • written 9.2 years ago by mlank001 • 0

Ram · Answer 1 · 2015-02-27

2

Entering edit mode

9.2 years ago

PoGibas 5.1k

Answer: This gene doesn't have ~~normal~~ usual structure as it is a long non-coding RNA (transcribed_unprocessed_pseudogene).

Explanation: In UCSC browser find gene name: DDX11L1.

curl -s "ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gtf.gz" |   
    gunzip -c | 
    tr -d ';"' | 
    awk '($3=="gene" && $18=="DDX11L1") {print $14}'

transcribed_unprocessed_pseudogene

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.2 years ago by PoGibas 5.1k

0

Entering edit mode

Got it! thanks a lot! I appreciate it..

So this means I should indeed be looking at the cds Start and end positions for extracting the reference genome sequence to be translated.

ADD REPLY • link updated 23 months ago by Ram 43k • written 9.2 years ago by mlank001 • 0

0

Entering edit mode

It depends. Are you interested only in protein coding genes? If so, check genecode genes, you can download list of all the protein coding genes (start end positions for gene/transcript/UTR/exon/CDS and genomic or translated fasta sequences).

ADD REPLY • link updated 23 months ago by Ram 43k • written 9.2 years ago by PoGibas 5.1k

score 0 · Answer 2 · 2015-02-28

0

Entering edit mode

9.2 years ago

Chirag Nepal ★ 2.4k

In general, CDS start-end, they are noncoding transcripts

ADD COMMENT • link 9.2 years ago by Chirag Nepal ★ 2.4k