I've already looked here and in other forums, but couldn't find the answer to my question.
I want to design baits for a target enrichment Sequencing approach and have the output of a MarkerMiner search for orthologous loci from four different genomes with A. thaliana as a Reference as. These output alignments are separate Fasta-Files for each A. thaliana annotated gene with the sequences from my datasets aligned to it. I have already run a script to filter out those loci supported to be orthologous by at least two of my four input datasets.
However, now, I'm stumped. My alignments are gappy, since the input data is mostly RNAseq whereas the Reference contains the introns as well. So it looks like this :
AT01G1234567
ATCGATCGATGCGCGCTAGCTGAATCGATCGGATCGCGGTAGCTGGAGCTAGSTCGGATCGC
MyData1
--CGATGCGCGC-----------CGGATCGCGG---------------CGGATCGC
MyData2
--CGCTGCGCGC------------GGATAGCGG---------------CGGATCCC
To effectively design baits I now need to extract all the aligned parts from the file, so that I will end up with separate files; or separate alignments within a new file; for the parts that are aligned between MyData and the Reference sequence with all the gappy parts excluded. There are about 1300 of these fasta files, so doing it manually is no option.
I have a bit of programming experience in python and with Linux command line tools, however I am completely lost on how to go about this. I would appreciate a hint, on what kind of tools are out there I could use or what kind of algorithm I need to come up with.
I thought about a position guided approach; i.e. defining the position on the reference sequence, where the aligned parts of myData starts and ends and then extracting that. But I dont know how to iterate over the alignment and determining if it is aligned or not. Any help is greatly appreciated.
Thank you.
Cheers