I have a set of sequence files, where individual sequences look like this:
GACTACACGTAGTATATCTAGCGACTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTATCCCTACCTGATCCGAGGTCAACCTGAAAAATGGGGGTTCGTGCAGGTGCCGCCCGGGGTCGTGTAGCGAGGAGTATTACTACGCTTAGAGCCCGGCCGTACCGCCACTGCTTTTGTAGGCCCGCCAACCGGCGGTGCCCAACGACCCAGCGAGCTGGATTGGTTATAATGACGCTCGAACAGGCATGCCCCTCGGAATACCAAGGGGCGCAATGTGCGTTCAAAGATTCGATGATTCACTGAATTCTGCAATTCACATTACTTATCGCATTTCGCTACGTTCTTCATCGATGACGAGATACTAGGTGTGGTCGGCGTCTCTCAAGGCACACAGGGGATAGG
There is a primer sequence that has a few variable positions: TCCTSCGCTTATTGATATGC
However, this particular sequence has 26bp before the primer starts, that includes both an adaptor sequence and a barcode. In this case that sequence is: GACTACACGTAGTATATCTAGCGACT
So, the primer is always 20bp, but the leading adaptor and barcode can vary by 2bp in length, hence I can't just trim 46bp off the front. Is there a good way to handle this? Unforunately I dont have a file of adaptor and barcode sequences, just this information. It may be relevant that this is old 454 data.
Is that a typo?
What is your aim with this data? If you are just going to align it (depending on the aligner) the primer and tag can get softclipped.
its intended to represent a variable position base (either a G or a C), these are old sequences (~2012), and I realize naming conventions for variable positions may have changed.
I want to remove the primer sequence and everything that comes before it. I do not plan to align these sequences, they are going to be compared to a reference database (UNITE) via RDP. They are amplicon fungal ITS2 sequences from soil.
A second follow up: After running this command my sequence file looks like this:
The sequence is actually broken up onto multiple lines, rather being written to a single line. This can be verified with wc -l filename.fna. This is going to be a problem downstream. Any ideas on how to fix this?
It is ok to have fasta files with sequence wrapping around like it. What program are you planning to use for downstream analysis? If you do want to make them single line fasta then use the solution in this thread: Multiline Fasta To Single Line Fasta
Gotcha. I do some manipulations using sed downstram which require the sequence line to be a single line. Thanks for this link.