Amino acid sequence from CDS information of gff file
0
0
Entering edit mode
4.9 years ago
doubleA • 0

Hi all, I have a somewhat basic question.

The inputs I use for analysis are the reference genome (hg38) and my sample vcf file.

I extracted the CDS region of some gene from hg38 gff file.

For example,

128937678-128937818
128936538-128936649
128935037-128935044
128937678-128937818
128936538-128936649
128935591-128935700
128937678-128937818

After that, I extracted the consensus sequence of the cds region from my sample vcf file.

Ultimately, I want to get the amino acid sequence.

I wonder if the nucleotide sequences of the CDS region of a gene extracted above can be combined and converted into amino acid sequences.

For example,

128937678-128937818 -> GAAGTG
128936538-128936649 -> GAGGCATCTCTGA
128935037-128935044 -> GAGCGAG
128937678-128937818 -> ATCTTCGG
128936538-128936649 -> CCTTCGATG
128935591-128935700 -> TTGACAACATCT
128937678-128937818 -> AGCATTTCCTC
Combination -> GAAGTGGAGGCATCTCTGAGAGCGAGATCTTCGGCCTTCGATG TTGACAACATCTAGCATTTCCTC -> Convert to amino acid sequence

Can I get the amino acid sequence like this?

Amino acid CDS gff • 4.1k views
ADD COMMENT
3
Entering edit mode

If the GFF format is correct, try gffread with -y: (-y write a protein fasta file with the translation of CDS for each record)

$ gffread -y proteins.fa -g Homo_sapiens.GRCh38.dna.chromosome.1.fa Homo_sapiens.GRCh38.96.chromosome.1.gff3
$ head proteins.fa
>transcript:ENST00000641515 gene=OR4F5
MKKVTAEAISWNESTSETNNSMVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIVITVVSDSHLH
SPMYFLLANLSLIDLSLSSVTAPKMITDFFSQRKVISFKGCLVQIFLLHFFGGSEMVILIAMGFDRYIAI
CKPLHYTTIMCGNACVGIMAVTWGIGFLHSVSQLAFAVHLLFCGPNEVDSFYCDLPRVIKLACTDTYRLD
IMVIANSGVLTVCSFVLLIISYTIILMTIQHRPLDKSSKALSTLTAHITVVLLFFGPCVFIYAWPFPIKS
LDKFLAVFYSVITPLLNPIIYTLRNKDMKTAIRQLRKWDAHSSVKF.
>transcript:ENST00000335137 gene=OR4F5
MVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIVITVVSDSHLHSPMYFLLANLSLIDLSLSSVT
APKMITDFFSQRKVISFKGCLVQIFLLHFFGGSEMVILIAMGFDRYIAICKPLHYTTIMCGNACVGIMAV
TWGIGFLHSVSQLAFAVHLLFCGPNEVDSFYCDLPRVIKLACTDTYRLDIMVIANSGVLTVCSFVLLIIS
ADD REPLY
0
Entering edit mode

do you have to do this for only a few CDS or for plenty of them?

In case only few: copy-paste the DNA seq in a translation tool (eg EMBOSS transeq) .

your example, however, does not really look like a valid CDS (it does not start with an ATG for instance)

ADD REPLY
0
Entering edit mode

This is theoretically not too difficult to do, but I'm guessing since these are discontinuous ranges, they've had exons removed?

How do you define where one the first real CDS starts ends, and the next one begins, if all of your data looks like that?

ADD REPLY

Login before adding your answer.

Traffic: 2020 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6