How To Extract Snps From A Desired List Of Gene Names
3
2
Entering edit mode
12.5 years ago
Kevin ▴ 640

hi

If I had the genomic coords, I can get the intersection of SNPs and my gene list. But right now I just have my gene list.

What would be your method to extract SNPs that fall within a desired gene list?

Cheers

disease snp • 5.0k views
ADD COMMENT
6
Entering edit mode
12.5 years ago
Chris Penkett ▴ 490

EnsEMBL Perl API (this is largely from the courses they give!):

use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $reg = 'Bio::EnsEMBL::Registry';
$reg->load_registry_from_db(-host => 'ensembldb.ensembl.org', -user => 'anonymous');

my $gene_name = shift;

my $ga = $reg->get_adaptor('Human', 'Core', 'Gene');
my $sa = $reg->get_adaptor('Human', 'Core', 'Slice');
my $vfa = $reg->get_adaptor('Human', 'Variation', 'VariationFeature');

my $genes = $ga->fetch_all_by_external_name($gene_name);

while (my $gene = shift @{$genes}) {

  my $chr   = $gene->seq_region_name;
  my $start = $gene->seq_region_start;
  my $end   = $gene->seq_region_end;

  my $length = $end - $start + 1;

  my $region = sprintf "%s:%d-%d", $chr, $gene->start, $gene->end;

  print join("\t", ($gene->stable_id, $region, $length, $gene->external_name, $gene->description) ), "\n";

  my $slice = $sa->fetch_by_region('chromosome', $chr, $start, $end);

  my @vfs = @{$vfa->fetch_all_by_Slice($slice)};

  for my $vf (@vfs) {
    print
      $vf->variation_name, ' has alleles ', $vf->allele_string,
      ' located at ', $slice->seq_region_name, ':', 
      $vf->seq_region_start, '-', $vf->seq_region_end, "\n";
  }
}

% perl get_snps_gene_name.pl IL4
ENSG00000113520    5:132009678-132018368    8691    IL4    interleukin 4 [Source:HGNC Symbol;Acc:6014]
rs2070874 has alleles C/T located at 5:132009710-132009710
rs2243251 has alleles A/G located at 5:132009787-132009787
rs4986964 has alleles T/C located at 5:132009821-132009821
rs56279116 has alleles G/A located at 5:132010172-132010172
rs55743996 has alleles C/G located at 5:132010191-132010191
rs56141757 has alleles G/A located at 5:132010216-132010216
rs71645907 has alleles C/G located at 5:132010423-132010423
rs71645908 has alleles C/T located at 5:132010490-132010490
rs113312579 has alleles A/T located at 5:132010563-132010563
rs71645909 has alleles A/G located at 5:132010573-132010573
rs2243252 has alleles T/C located at 5:132010588-132010588
rs2079102 has alleles T/G located at 5:132010606-132010606
rs734244 has alleles C/T located at 5:132010726-132010726
rs2243253 has alleles C/T located at 5:132010761-132010761
rs4996002 has alleles T/G located at 5:132010764-132010764
rs71645910 has alleles C/T located at 5:132010822-132010822
rs71645911 has alleles G/A located at 5:132010826-132010826
rs11479198 has alleles C/- located at 5:132011166-132011166
...

It can also give you consequences of the SNP within the gene (intronic, non-synonymous, etc), phenotypes, 1000 genome population info, etc.

e.g., 1000K population allele frequencies:

my $aa  = $reg->get_adaptor($species, 'variation', 'allele');

my $alleles = $aa->fetch_all_by_Variation($var);

for my $a (@$alleles) {
  if (defined $a->population and $a->allele eq $$vars{$name}{alt_base}) {
    if ($a->population->name eq '1000GENOMES:pilot_1_CEU_low_coverage_panel') {
      $allele_freq_string .= 'CEU:' . sprintf("%.2f", 100*$a->frequency) . ';';
    } elsif ($a->population->name eq '1000GENOMES:pilot_1_CHB+JPT_low_coverage_panel') {
      $allele_freq_string .= 'CHB+JPT:' . sprintf("%.2f", 100*$a->frequency) . ';';
    } elsif ($a->population->name eq '1000GENOMES:pilot_1_YRI_low_coverage_panel') {
      $allele_freq_string .= 'YRI:' . sprintf("%.2f", 100*$a->frequency) . ';';
    }
  }
}

print $allele_freq_string . "\n";

Trait/disease association:

my $vaa = $reg->get_adaptor($species, 'variation', 'variationannotation');

foreach my $var_anno (@{$vaa->fetch_all_by_Variation($var)}) {
  next if not defined $var_anno->phenotype_description;
  $var_pheno_string .= $var_anno->phenotype_description;
  if (defined $var_anno->p_value) {
    $var_pheno_string .= ' (' . $var_anno->p_value . ')';
  } 
  if (defined $var_anno->associated_variant_risk_allele) {
    $var_pheno_string .= ' [risk allele: ' . $var_anno->associated_variant_risk_allele . ']';
  } 
  $var_pheno_string .= ';';
} 

print $var_pheno_string . "\n";

Chris

ADD COMMENT
4
Entering edit mode
12.5 years ago

Use Biomart.

  • Go to Biomart Central
  • Select "Ensembl Variation" as Dataset, and the organism of your interest
  • Open the "Filters" tab and define the genes
  • Open the "Attributes" tab and define the info you need (e.g. SNP ID)

For more complex searchers, you can do this in two steps. For example, you can first retrieve the coordinates of the genomic regions of interest using the code Ensembl Genes dataset, and then use these results to query Ensembl Variation.

To automatize these steps, you can also have a look at Galaxy.

ADD COMMENT
4
Entering edit mode
12.5 years ago

You can use the public UCSC mysql server. E.g:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e '

select
 X.geneSymbol,
 S.name,
 S.chrom,
 S.chromStart

from
snp132 as S,
knownGene as K,
kgXref as X
where
S.chrom=K.chrom and 
X.kgId=K.name and
not(txEnd<=S.chromStart or S.chromEnd<=K.txStart)  and
 X.geneSymbol in ("EIF2A","EIF4G1")'


geneSymbol    name    chrom    chromStart
EIF2A    rs73869253    chr3    150264619
EIF2A    rs114580236    chr3    150264679
EIF2A    rs16862731    chr3    150264706
EIF2A    rs17280260    chr3    150264778
EIF2A    rs115840267    chr3    150264789
(...)
ADD COMMENT

Login before adding your answer.

Traffic: 1514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6