Greetings all,
I'm currently working on a molecular ecological study and have a fasta file containing both alleles for 50 loci from 8 individuals. Some of the programs I want to use require a separate fasta file for each locus, and I'm trying to find a way to group the data into files based on their catalog locus ID.
My data currently looks like this:
>CLocus_2857_Sample_8_Locus_15367_Allele_0 [Test_1]
AATTCGCGGTGGGGCTCTACAGGCAGCAGAATCCCTTCAGCACCCAGCCCAGGGCTGCCCTGGAGAAGGTCTGGATGTGCAGTGAATGAGATGGGGCCACAAGAAATGTGAGCTGAAGTCACGGGATGGATCCTCAGGCTGC
>CLocus_2857_Sample_8_Locus_15367_Allele_1 [Test_1]
AATTCGCGGTGGGGCTCTACAGGCAGCAGAATCCCTTCAGCACCCAGCCCAGGGCTGCCCTGGAGAAGGTCTGGATGTGCAGTGAATGAGATGGGGCCACAAGAAATGTGAGCTGAAGTCACGGGATGGATCCTCAGGCTGC
>CLocus_2886_Sample_0_Locus_62236_Allele_0 [Test_2]
AATTCAGTGTGGTGGTCTTCCTGGACTGGGTCACGGCCTTTTTTTGTGGGATGCACGTGTGCTTTGTGTGTTTGTGTGTGACCAAAAGCTAAATTAATTGGAAAATGAGTCTGTACTGTTTTGCAAATATGTTAAATGATGT
>CLocus_2886_Sample_0_Locus_62236_Allele_1 [Test_2]
AATTCAGTGTGGTGGTCTTCCTGGACTGGGTCACGGCCTTTTTTTGTGGGATGCACGTGTGCTTTGTGTGTTTGTGTGTGACCAAAAGCTAAATTAATTGGAAAATGAGTCTGTACTGTTTTGCAAATATGTTAAATGATGT
Where the number 'CLocus_X' is what I want to group them by.
I think the answer lies with using the awk tool, but my attempts so far have been clumsy and unsuccessful - I'm new to scripting, and stumped. I've found similar answers on here, but nothing that handles .fasta files with the same syntax.
Any advice would be hugely appreciated! Thank you all for your time.
it doesn't look like 'fasta', there is only one line per record including name AND sequence (?)
Apologies, there should be two lines per record, one for the name and one for the sequence - it looked normal when I pasted it into the box. I'll edit it to properly reflect the format.