Separate fasta file for each locus
1
0
Entering edit mode
5.5 years ago
NM • 0

Greetings all,

I'm currently working on a molecular ecological study and have a fasta file containing both alleles for 50 loci from 8 individuals. Some of the programs I want to use require a separate fasta file for each locus, and I'm trying to find a way to group the data into files based on their catalog locus ID.

My data currently looks like this:

    >CLocus_2857_Sample_8_Locus_15367_Allele_0 [Test_1]  
AATTCGCGGTGGGGCTCTACAGGCAGCAGAATCCCTTCAGCACCCAGCCCAGGGCTGCCCTGGAGAAGGTCTGGATGTGCAGTGAATGAGATGGGGCCACAAGAAATGTGAGCTGAAGTCACGGGATGGATCCTCAGGCTGC
    >CLocus_2857_Sample_8_Locus_15367_Allele_1 [Test_1]  
AATTCGCGGTGGGGCTCTACAGGCAGCAGAATCCCTTCAGCACCCAGCCCAGGGCTGCCCTGGAGAAGGTCTGGATGTGCAGTGAATGAGATGGGGCCACAAGAAATGTGAGCTGAAGTCACGGGATGGATCCTCAGGCTGC
    >CLocus_2886_Sample_0_Locus_62236_Allele_0 [Test_2]  
AATTCAGTGTGGTGGTCTTCCTGGACTGGGTCACGGCCTTTTTTTGTGGGATGCACGTGTGCTTTGTGTGTTTGTGTGTGACCAAAAGCTAAATTAATTGGAAAATGAGTCTGTACTGTTTTGCAAATATGTTAAATGATGT
    >CLocus_2886_Sample_0_Locus_62236_Allele_1 [Test_2]  
AATTCAGTGTGGTGGTCTTCCTGGACTGGGTCACGGCCTTTTTTTGTGGGATGCACGTGTGCTTTGTGTGTTTGTGTGTGACCAAAAGCTAAATTAATTGGAAAATGAGTCTGTACTGTTTTGCAAATATGTTAAATGATGT

Where the number 'CLocus_X' is what I want to group them by.

I think the answer lies with using the awk tool, but my attempts so far have been clumsy and unsuccessful - I'm new to scripting, and stumped. I've found similar answers on here, but nothing that handles .fasta files with the same syntax.

Any advice would be hugely appreciated! Thank you all for your time.

RADseq fasta awk • 1.6k views
ADD COMMENT
0
Entering edit mode

it doesn't look like 'fasta', there is only one line per record including name AND sequence (?)

ADD REPLY
0
Entering edit mode

Apologies, there should be two lines per record, one for the name and one for the sequence - it looked normal when I pasted it into the box. I'll edit it to properly reflect the format.

ADD REPLY
2
Entering edit mode
5.5 years ago
$  grep -Po '(?<=^>)[A-Za-z]*_[0-9]*(?=.*)' test.fa |  uniq |  parallel 'grep -A 1 {} test.fa > {}.fa'

output:

$ cat CLocus_2886.fa 
>CLocus_2886_Sample_0_Locus_62236_Allele_0 [Test_2] 
AATTCAGTGTGGTGGTCTTCCTGGACTGGGTCACGGCCTTTTTTTGTGGGATGCACGTGTGCTTTGTGTGTTTGTGTGTGACCAAAAGCTAAATTAATTGGAAAATGAGTCTGTACTGTTTTGCAAATATGTTAAATGATGT
>CLocus_2886_Sample_0_Locus_62236_Allele_1 [Test_2] 
AATTCAGTGTGGTGGTCTTCCTGGACTGGGTCACGGCCTTTTTTTGTGGGATGCACGTGTGCTTTGTGTGTTTGTGTGTGACCAAAAGCTAAATTAATTGGAAAATGAGTCTGTACTGTTTTGCAAATATGTTAAATGATGT

$ cat CLocus_2857.fa 
>CLocus_2857_Sample_8_Locus_15367_Allele_0 [Test_1] 
AATTCGCGGTGGGGCTCTACAGGCAGCAGAATCCCTTCAGCACCCAGCCCAGGGCTGCCCTGGAGAAGGTCTGGATGTGCAGTGAATGAGATGGGGCCACAAGAAATGTGAGCTGAAGTCACGGGATGGATCCTCAGGCTGC
>CLocus_2857_Sample_8_Locus_15367_Allele_1 [Test_1] 
AATTCGCGGTGGGGCTCTACAGGCAGCAGAATCCCTTCAGCACCCAGCCCAGGGCTGCCCTGGAGAAGGTCTGGATGTGCAGTGAATGAGATGGGGCCACAAGAAATGTGAGCTGAAGTCACGGGATGGATCCTCAGGCTGC

input:

$ cat test.fa
>CLocus_2857_Sample_8_Locus_15367_Allele_0 [Test_1] 
AATTCGCGGTGGGGCTCTACAGGCAGCAGAATCCCTTCAGCACCCAGCCCAGGGCTGCCCTGGAGAAGGTCTGGATGTGCAGTGAATGAGATGGGGCCACAAGAAATGTGAGCTGAAGTCACGGGATGGATCCTCAGGCTGC
>CLocus_2857_Sample_8_Locus_15367_Allele_1 [Test_1] 
AATTCGCGGTGGGGCTCTACAGGCAGCAGAATCCCTTCAGCACCCAGCCCAGGGCTGCCCTGGAGAAGGTCTGGATGTGCAGTGAATGAGATGGGGCCACAAGAAATGTGAGCTGAAGTCACGGGATGGATCCTCAGGCTGC
>CLocus_2886_Sample_0_Locus_62236_Allele_0 [Test_2] 
AATTCAGTGTGGTGGTCTTCCTGGACTGGGTCACGGCCTTTTTTTGTGGGATGCACGTGTGCTTTGTGTGTTTGTGTGTGACCAAAAGCTAAATTAATTGGAAAATGAGTCTGTACTGTTTTGCAAATATGTTAAATGATGT
>CLocus_2886_Sample_0_Locus_62236_Allele_1 [Test_2] 
AATTCAGTGTGGTGGTCTTCCTGGACTGGGTCACGGCCTTTTTTTGTGGGATGCACGTGTGCTTTGTGTGTTTGTGTGTGACCAAAAGCTAAATTAATTGGAAAATGAGTCTGTACTGTTTTGCAAATATGTTAAATGATGT
ADD COMMENT
1
Entering edit mode

It’s probably worth mentioning that this solution requires linearised sequences (though I also assume thats what OP has).

ADD REPLY
0
Entering edit mode

EDIT: The command works now - just had to load the parallel tool first. Thank you again!

Thank you for the quick response!

I can get the first half of the command to work fine - it outputs a list of the CLocus ID numbers. However, I can't get 'parallel' to function on the computing cluster I'm working from.

I've tried using xargs as a substitute but when I enter the command as follows:

grep -Po '(?<=^>)[A-Za-z]*_[0-9]*(?=.*)' ./batch_1.samples.fa | uniq | xargs grep -A 1 {} batch_1.samples.fa > {}.fa

I get a list of errors like this:

grep: CLocus_2857: No such file or directory grep: CLocus_2886: No such file or directory etc.

From what I can tell, it's trying to find files named after the CLocus IDs instead of searching -for- the CLocus IDs in the samples.fa file. Keeping the single quotes around the entire second command seems to treat the entire string as a filename/directory.

What should I be changing? Thank you again for the help.

ADD REPLY
0
Entering edit mode

if you are on ubuntu/mint/debian:

sudo apt install parallel -y

This should install gnu-parallel.

ADD REPLY

Login before adding your answer.

Traffic: 2436 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6