Why is the cds start and cds end position the same in some of the known-gene data downloaded from ucsc ? (ucsc table browser)
2
0
Entering edit mode
9.2 years ago
mlank001 • 0

I am working on human cancer data and wanted to extract sequences from the human genome for analyzing SNPs. For that, I am trying to fetch exon start and end positions for the human genome build GRCh37. I used ucsc genome browser to download this data (using known Gene table)

The data has the same values for cds start and cds end. I am confused as to why this is the case, as cds is for coding region. Should I ignore this and go use exon start and end positions?

Any help is highly appreciated!

Thanks!

UCSC-Genome-Browser • 6.0k views
ADD COMMENT
0
Entering edit mode

Please post example of "The data has the same values for cds start and cds end".

ADD REPLY
0
Entering edit mode

Here is the example of one entry:

#na           chrom    strand    txStart    txEnd    cdsStart    cdsEnd    exonCount    exonStarts            exonEnds              proteinID      alignID
uc001aaa.3    chr1     +         11873      14409    11873       11873     3            11873,12612,13220,    12227,12721,14409,    uc001aaa.3

If you look at the 6th and 7th columns (cdsStart and cdsEnd), the values are the same. I am confused as to why that's the case.

ADD REPLY
2
Entering edit mode
9.2 years ago
PoGibas 5.1k

Answer: This gene doesn't have normal usual structure as it is a long non-coding RNA (transcribed_unprocessed_pseudogene).

Explanation: In UCSC browser find gene name: DDX11L1.

curl -s "ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gtf.gz" |   
    gunzip -c | 
    tr -d ';"' | 
    awk '($3=="gene" && $18=="DDX11L1") {print $14}'

transcribed_unprocessed_pseudogene
ADD COMMENT
0
Entering edit mode

Got it! thanks a lot! I appreciate it..

So this means I should indeed be looking at the cds Start and end positions for extracting the reference genome sequence to be translated.

ADD REPLY
0
Entering edit mode

It depends. Are you interested only in protein coding genes? If so, check genecode genes, you can download list of all the protein coding genes (start end positions for gene/transcript/UTR/exon/CDS and genomic or translated fasta sequences).

ADD REPLY
0
Entering edit mode
9.2 years ago
Chirag Nepal ★ 2.4k

In general, CDS start-end, they are noncoding transcripts

ADD COMMENT

Login before adding your answer.

Traffic: 2559 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6