Question

GO Annotation of fasta sequences

0

Entering edit mode

5.2 years ago

dthorbur ★ 1.9k

I am conducting a genome-wide scan for balancing selection. There are several levels of my analysis, and I now have several thousand 7-70kb windows that met the threshold, and have since extracted a sequence (from the reference genome) that represent each genomic region potentially under selection.

My next step would be to identify which genes are present in each of these windows, and then once I have a gene list, I can do a GO Enrichment analysis. However, I used Blast+ to identify what genes are present, and the output doesn't give a Gene Stable ID, instead giving an accession version number (bold below; I've also shortened the query name here). How would I get a list of genes from this? I can write a script to parse the .tsv's no problem, but it's using "AC144488.2" as a search term that confuses me as I can't do this individually.

Seq1 AC144488.2 94.127 1226 29 8 16442 17638 103536 102325 0.0 1825

Seq2 AC144488.2 95.726 936 24 9 16709 17637 25817 24891 0.0 1493

Seq3 AC144488.2 97.496 639 16 0 23259 23897 210217 209579 0.0 1092

Maybe I've misunderstood, but to use programs like Panther and TopGO, I already need gene ID's, or converted AA sequences. Since I am searching for what genes are present in these sequences, I don't have a gene list yet.

I'm happy to clarify any of my points, and would greatly appreciate any and all guidance.

Thanks

GO gene annotation annotation • 1.4k views

ADD COMMENT • link 5.2 years ago by dthorbur ★ 1.9k

1

Entering edit mode

Why don't you use the annotation for the reference genome which will contain the gene IDs? I.e., instead of extracting the sequences for the windows of interest, do a bedtools intersect or something like that with the annotation file for your reference genome.

ADD REPLY • link 5.2 years ago by Friederike 8.9k

0

Entering edit mode

Thanks for the idea - I hadn't thought of approaching the issue like that. I'll find an annotation track for my reference genome and see what I manage to pull out.

ADD REPLY • link 5.2 years ago by dthorbur ★ 1.9k

score 1 · Accepted Answer · 2019-02-22

1

Entering edit mode

5.2 years ago

dthorbur ★ 1.9k

Whilst the bedtools intesect idea sounds like it would work. I found an alternative method that appears to have a little more power I am writing in R - I can quickly identify which gene, transcript, and even which exons are in any of my ~5,000 windows. This extra power might be useful later one to put some of our results into a biological context, for example if we detect selection on the hypervariable region of MHC, rather than more conserved regions of the gene.

I have downloaded all the gene, transcript, and exon locations across every chromosome in my reference genome from ENSEMBL Biomart, Then it's not hard to test whether a genes are present in my windows, including the ability to test whether the whole gene is within my window or not.

ADD COMMENT • link 5.2 years ago by dthorbur ★ 1.9k

0

Entering edit mode

Sounds absolutely valid, great to see you've found something that helps you move forward! Maybe you can post a minimal example of your solution here so that other biostar users with the same question may have a pointer in the future.

Good luck with the analysis!

ADD REPLY • link 5.2 years ago by Friederike 8.9k

1

Entering edit mode

So my genomic windows is in a table that looks like this (I have removed a bunch of columns for simplicity):

Chr, Population, Highest_Peak_Start, Highest_Peak_End, Cluster_Tajima, Peak_Tajima,

2, River, 14890502, 14901001, 2.197490841, 2.297172321

4, River, 12831002, 12841501, 2.079510541, 2.094626411

The Ensembl Biomart table looks something like (I've taken transcript and exon ID and locations out here, but you can easily include them if relevant):

St_Gene_ID, Gene_Start, Gene_End, Gene_Name, Chrom

ENSGACG00000022427, 2723122, 2723190, RF00069, 14

ENSGACG00000022906, 27633556, 27633684, RF00618, 1

Th R script got a little long - so if you want it message me - as it does a little more than just tests whether there is overlap between the two datasets. It's simply a matter of subsetting the Biomart data set based on the window (Highest_Peak_x in my table) start and end.

ADD REPLY • link 5.1 years ago by dthorbur ★ 1.9k