I am conducting a genome-wide scan for balancing selection. There are several levels of my analysis, and I now have several thousand 7-70kb windows that met the threshold, and have since extracted a sequence (from the reference genome) that represent each genomic region potentially under selection.
My next step would be to identify which genes are present in each of these windows, and then once I have a gene list, I can do a GO Enrichment analysis. However, I used Blast+ to identify what genes are present, and the output doesn't give a Gene Stable ID, instead giving an accession version number (bold below; I've also shortened the query name here). How would I get a list of genes from this? I can write a script to parse the .tsv's no problem, but it's using "AC144488.2" as a search term that confuses me as I can't do this individually.
Seq1 AC144488.2 94.127 1226 29 8 16442 17638 103536 102325 0.0 1825
Seq2 AC144488.2 95.726 936 24 9 16709 17637 25817 24891 0.0 1493
Seq3 AC144488.2 97.496 639 16 0 23259 23897 210217 209579 0.0 1092
Maybe I've misunderstood, but to use programs like Panther and TopGO, I already need gene ID's, or converted AA sequences. Since I am searching for what genes are present in these sequences, I don't have a gene list yet.
I'm happy to clarify any of my points, and would greatly appreciate any and all guidance.
Thanks
Why don't you use the annotation for the reference genome which will contain the gene IDs? I.e., instead of extracting the sequences for the windows of interest, do a
bedtools intersect
or something like that with the annotation file for your reference genome.Thanks for the idea - I hadn't thought of approaching the issue like that. I'll find an annotation track for my reference genome and see what I manage to pull out.