I try to collect known disease causative mutations with full genome coordinate and call information to build a golden standard (and search the obtained list against my full genome data) - BED format is my target to implement bedtool or galaxy on top.
A general comment: _why are BED, GFF, or similar shared format not supported by public databases as standard DL format???_
I found, with help of colleagues, several sources of disease mutations including:
- OMIM variants extracted by Omicia and provided as a track (OMICIA_auto) on the next release of UCSC tables (http://genome-preview.ucsc.edu/...)
- COSMIC rev54 (now 55 since a couple of days) DL as a text table I had to convert to BED with some perl magic (ftp://ftp.sanger.ac.uk/pub/CGP/cosmic)
- dbSNP was not an easy catch and I am still struggling to get the full information from their difficult batch download system ( _only feasible through ensembl BIOMART so far: [tip: hg18 BIOMART is at:http://may2009.archive.ensembl.org/biomart/martview/]_ ). For dbSNP, I searched for records with phenotype ( _thanks to another colleague_ ) which is the only available annotation to pick disease variants but in fact includes many association results which are far from being causative .
REM : As you could notice, I still work with hg18|Build36 but more recent data would do as well with some liftover. If someone has other sources, it would be great to share as this is likely a common request for people willing to mine in patient full genomes.