Question

how to identify degenerate primers in reads

0

Entering edit mode

5.1 years ago

genya35 ▴ 40

Hello,

I have a two adapter trimmed Illumina fastq files (read1 and read2). I would like to identify the primer sequence. I have some idea what the primer sequences are however, they appears to be degenerate. So far I've found over 200 variations for the primer sequence in reads and looking for a tool that would group them and produce degenerate primer sequence.

Next, I plan to separate the reads into separate files based on the primer sequence using bbtools.

Please suggest an approach/tool to use.

next-gen • 1.7k views

ADD COMMENT • link updated 4.8 years ago by Biostar 20 • written 5.1 years ago by genya35 ▴ 40

0

Entering edit mode

Can you give some details on what you mean by "primer sequence". What kind of assay are you using?

ADD REPLY • link 5.1 years ago by ATpoint 81k

0

Entering edit mode

it's an amplicon based assay. By primer sequence I mean "NNCTGGGTCCGCCAAGCT". Thanks

ADD REPLY • link 5.1 years ago by genya35 ▴ 40

0

Entering edit mode

I have a bunch of sequences that I'm trying to figure out at which base they are ambiguous. I wonder if there is a tool that can help with that.

AAGGATCCGGCAGCCCCC
ACGGATCCGGCAGCCCCC
AGGGATCCGGCAGCCCCC
ATGGATCCGGCAGCCCCC
CAGGATCCGGCAGCCCCC
CGGGATCCGGCAGCCCCC
GAGGATCCGGCAGCCCCC
GCGGATCCGGCAGCCCCC

ADD REPLY • link 5.1 years ago by genya35 ▴ 40

1

Entering edit mode

@Brian had provided this solution in past. Change the number 20 depending on size of primer you expect.

Following will trim all but first 20 bases (I assume your primer is at the beginning of read)?

$ reformat.sh in=reads.fq out=trimmed.fq ftr=19

This will generate a file containing the counts of all 20-mers that occurred at least 10 times, in a 2-column format that is easy to sort (in excel if you wish)

$ kmercountexact.sh in=trimmed.fq out=counts.txt fastadump=f mincount=10 k=20 rcomp=f

If primers are 20 bp long they should be easily identifiable.

ACCGTTACCGTTACCGTTAC    100
AAATTTTTTTCCCCCCCCCC    85

If this works let me know and I will move it to an answer.

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

This was really helpful, however, I ended up with 1560 unique sequences. Is there a tool that would group them based on the similarity?

ADD REPLY • link 5.1 years ago by genya35 ▴ 40

0

Entering edit mode

Use unix sort to sort the first column containing the sequence. That said you did not get a few clear contenders (with lots of counts) as compared to rest? Are you explaining what you need accurately?

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

@genomax There is too much variation in the first 3 bases of each read. Is there a way to tell reformat.sh to ignore the first three bases and look at the base 4 to 23? thanks

ADD REPLY • link 5.1 years ago by genya35 ▴ 40

0

Entering edit mode

You can add ftl=2 to remove first three bases and extend ftr=22 to still get 20 bases total in first reformat.sh command above. Then follow the same process.

Just in case both those options don't work right together you could do something like (two consecutive rounds of trimming)

reformat.sh in=reads.fq out=stdout.fq ftr=22 | reformat.sh in=stdin.fq out=trimmed.fq ftl=2

ADD REPLY • link 5.1 years ago by GenoMax 141k