Mapping GRCh38 to ENSEMBL / UCSC ,Gene, transcript , cDNA and Protein IDs and Sequence
3
0
Entering edit mode
7.2 years ago
inambioinfo ▴ 110

Hello Folks,

I am interested in mapping latest Genome GRCh38 to other standard databases such as ENSEMBL, UCSC and RefSeq.

As of now, i have GRCh38 in hand. I could like to know which all the files i need to have with me from the above database to map back to the complete genome with corresponding genes ID & Sequence, Coding ID & Sequence and Proteins ID & Sequence.

Eventually

1.Mapping : Chromosome Co-ordinates ---> Gene ID, Gene Sequence and Gene Name

2.Mapping : Gene ID, Gene Sequence and Gene Name ---> CDS ID & Seq / Exon ID & Seq / Start & End of CDS

3.Mapping : CDS ID & Seq / Exon ID & Seq / Start & End of CDS ---> Protein ID & Seq

I also want to use dbsnp and COSMIC for identification of variations in Protein Seq / Exon Seq / Gene Seq / Chromosome Co-ordinates.

I have already check information from ENSEMBL and got to know that it can be possible to work on Biomart if am into R or Bioconductor. But i prefer to do the same manually and program it locally to get the mapping data mention above.

Is there any level of information like GTF file where i can draw the whole mapping information. I will be grateful if there is any possibility of interlink or co-relation among the 3 Database (ENSEMBL,UCSC,NCBI) which will help me to map gene cds and protein in any of the 3 DB.

More detail suggestion will be appreciated and Thanks in advance for your response.

RNA-Seq SNP genome Mappping ENSEMBL • 4.9k views
ADD COMMENT
2
Entering edit mode
7.2 years ago
Ben_Ensembl ★ 2.4k

Hello,

Yes, these files look right for the analysis you wish to perform. With regards to the GTF, the README should clear this up for you: ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens//README

.gtf: this is the default file, it should contain the full annotation for all species except human and mouse. For human and mouse, it will contain all annotation on the primary assembly, ie excluding patch and haplotype regions

.chr.gtf: contains only annotation on chromosomes, so toplevel scaffolds are excluded (patch and haplotypes are not included)

For the list of variants, I would suggest using the Variant Effect Predictor: http://www.ensembl.org/info/docs/tools/vep/index.html

The VEP is an online tool that will allow you to retrieve the genomic co-ordinates of each of the variants, along with mappings to genes as well as cDNA and protein sequences.

Best wishes

Ben

ADD COMMENT
1
Entering edit mode
7.2 years ago
Ben_Ensembl ★ 2.4k

Hello,

As I work within the Ensembl team, I'll answer our question from an Ensembl point of view. You may want to get advice form others regarding how you can link the information together between the different resources.

Although we don't have web-based tool available for you to do this sort analysis, many of the queries you wish to perform 'manually' can be done using our REST API rest.ensembl.org).

The particular endpoints that will be relevant for your 3 mappings are in the 'Mappings' and 'Overlap' sections:

e.g: Genomic co-ordinates to gene ID: http://rest.ensembl.org/documentation/info/overlap_region

You may want to explore each of the GET or POST endpoints to see which ones suit the query you wish to perform.

Finally, you can download the GTF file (as well as many other files containing dumps from our databases for all species available in Ensembl) from our FTP site: http://www.ensembl.org/info/data/ftp/index.html

Best wishes

Ben Ensembl Helpdesk

ADD COMMENT
0
Entering edit mode
7.2 years ago
inambioinfo ▴ 110

Hi Ben,

Thanks for your information.

Am curious about few things and could like to make things clear.

Do i need to use the GTF file for mapping from ENSEMBL Database:

http://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.gtf.gz

or

http://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr.gtf.gz

If i use the following files then for mapping as i mention earlier

DNA : http://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/ ....................... (All Chromosomes.fa)

cDNA : http://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

CDS : http://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz

Peptide : http://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz

Variation : http://ftp.ensembl.org/pub/release-87/variation/vcf/homo_sapiens/Homo_sapiens.vcf.gz

Cinical Variation : http://ftp.ensembl.org/pub/release-87/variation/vcf/homo_sapiens/Homo_sapiens_clinically_associated.vcf.gz

If i want to map from Genome coordinates to till end protein/peptide in fasta sequence, Do you think i can use these files to start my work.

Also i wish to know how i can co-relate ID from Varant/dbsnp to Genome or cDNA or CDS or Protein with exact position.

Thanks once again Ben.

ADD COMMENT

Login before adding your answer.

Traffic: 2125 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6