Looking for clean list of gene names for UCSC's HG19
3
0
Entering edit mode
8.4 years ago
kynnjo ▴ 70

I'm trying to make sense of some data and analysis I've inherited. It's a big mess. The results include some gene names that are clearly corrupted ("Excel genes", etc.).

Therefore, for starters, I want to determine a "clean" list of the "gene universe" that was used for the analysis. All I have been able to ascertain is that the reference gene data used came from UCSC's HG19 build.

By "clean" I mean that the list I'm looking for should

  • be complete
  • contain no synonyms
  • consist of identifiers belonging to one and only one system of nomenclature

Of course, the ideal would be a simple file, published by UCSC, consisting of one gene name per line, no duplicates, but after much searching I have not been able to find such a file.

Does such a file exist? If not what would be the closest I could find?

Thanks in advance!

gene genome • 3.4k views
ADD COMMENT
2
Entering edit mode
8.4 years ago
Ram 43k

You can use the method suggested by Gian, or use UCSC's mysql access.

The query:

SELECT DISTINCT geneSymbol from kgXref kx;

or, directly from the command line:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e 'SELECT DISTINCT geneSymbol from kgXref'
ADD COMMENT
2
Entering edit mode
8.4 years ago
Lemire ▴ 940

This

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz

then

zcat refFlat.txt.gz | cut -f 1 | sort -u
ADD COMMENT

Login before adding your answer.

Traffic: 1999 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6