Question

Removing primer and a variable length adaptor/barcode sequence from .fna file

0

Entering edit mode

5.7 years ago

caverill ▴ 40

I have a set of sequence files, where individual sequences look like this:

GACTACACGTAGTATATCTAGCGACTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTATCCCTACCTGATCCGAGGTCAACCTGAAAAATGGGGGTTCGTGCAGGTGCCGCCCGGGGTCGTGTAGCGAGGAGTATTACTACGCTTAGAGCCCGGCCGTACCGCCACTGCTTTTGTAGGCCCGCCAACCGGCGGTGCCCAACGACCCAGCGAGCTGGATTGGTTATAATGACGCTCGAACAGGCATGCCCCTCGGAATACCAAGGGGCGCAATGTGCGTTCAAAGATTCGATGATTCACTGAATTCTGCAATTCACATTACTTATCGCATTTCGCTACGTTCTTCATCGATGACGAGATACTAGGTGTGGTCGGCGTCTCTCAAGGCACACAGGGGATAGG

There is a primer sequence that has a few variable positions: TCCTSCGCTTATTGATATGC

However, this particular sequence has 26bp before the primer starts, that includes both an adaptor sequence and a barcode. In this case that sequence is: GACTACACGTAGTATATCTAGCGACT

So, the primer is always 20bp, but the leading adaptor and barcode can vary by 2bp in length, hence I can't just trim 46bp off the front. Is there a good way to handle this? Unforunately I dont have a file of adaptor and barcode sequences, just this information. It may be relevant that this is old 454 data.

primer • 2.5k views

ADD COMMENT • link 5.7 years ago by caverill ▴ 40

0

Entering edit mode

TCCTSCGCTTATTGATATGC

Is that a typo?

What is your aim with this data? If you are just going to align it (depending on the aligner) the primer and tag can get softclipped.

ADD REPLY • link 5.7 years ago by WouterDeCoster 47k

1

Entering edit mode

its intended to represent a variable position base (either a G or a C), these are old sequences (~2012), and I realize naming conventions for variable positions may have changed.

I want to remove the primer sequence and everything that comes before it. I do not plan to align these sequences, they are going to be compared to a reference database (UNITE) via RDP. They are amplicon fungal ITS2 sequences from soil.

ADD REPLY • link 5.7 years ago by caverill ▴ 40

0

Entering edit mode

A second follow up: After running this command my sequence file looks like this:

>SRR1502564_598 SRR1502564.4200 ID1YYUI01ENB5O length=496 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
TTAAGTTCAGCGGGTAGTCTTACTTGATTTGAGATCGAGTTTTACAAGAGTCGTTTCCAACCCCTGTGAA
ATCCTGCATCAGTCAGCCAATACGGTCAGACTCCCTTTATGTTAGCTGCAGCAAAAGTAATAATCCGTTT
GACGGGACTAAATAGATATGCTTATAGCTCAGGAGAATGTCCAGCTGCACCTGCATTTCAAGCAACCCTC
CGCCGATCGTAAGCGACTGGTGTTGGGATTGCTCAAGTCCAAAGTCATTCCAGAAATAAATCAGAAATGT
CTTTGAGGTGTTTACTGATACTCAAACAAGCATGCTCTTCGGAATACCAAAGAGCGCAATATGCGTTCAA
AGATTCGATGATTCACTGAATTCTGCAATTCACATTACGTATCGCATTTCGCAGCGTTCTTCATCGATGA
CGAGTCTAGATACTAGGTGTGGTCGGCGTC

The sequence is actually broken up onto multiple lines, rather being written to a single line. This can be verified with wc -l filename.fna. This is going to be a problem downstream. Any ideas on how to fix this?

ADD REPLY • link updated 5.7 years ago by GenoMax 141k • written 5.7 years ago by caverill ▴ 40

1

Entering edit mode

It is ok to have fasta files with sequence wrapping around like it. What program are you planning to use for downstream analysis? If you do want to make them single line fasta then use the solution in this thread: Multiline Fasta To Single Line Fasta

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

Gotcha. I do some manipulations using sed downstram which require the sequence line to be a single line. Thanks for this link.

ADD REPLY • link 5.7 years ago by caverill ▴ 40

score 2 · Answer 1 · 2018-08-16

2

Entering edit mode

5.7 years ago

GenoMax 141k

Use bbduk.sh from BBMap suite with literal=TCCTGCGCTTATTGATATGC,TCCTCCGCTTATTGATATGC ktrim=l k=10 (and any additional options you need). This will remove the primer and all the bases to the left. Here is a detailed guide for the program.

ADD COMMENT • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

This works so well! Thank you! I see that ktrim=l tells bbduk to trim everything to the left. Can you elaborate on what k=10 indicates?

ADD REPLY • link 5.7 years ago by caverill ▴ 40

1

Entering edit mode

That is k-mer size used for the initial match of the literal sequence (which is then extended outwards).

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

Following up on this solution: what happens in the kmer is not found? Is it just passed to the output file? I am running this again to filter a sequence file that contains reads with multiple different primers. I would like to subset to a particular primer set of interest. To do this, i figured I could just use bbduk to trim the primer of interest, and discard reads that do not match the primer. Is there a simple way to do this with bbduk?

ADD REPLY • link 5.3 years ago by caverill ▴ 40