I want to analyse data from a CRISPR/Cas9 screen (control vs. treatment) and I'm using Mageck (https://sourceforge.net/projects/mageck/). Mageck calculates the trim length of reads automatically but in case the trim is variable the program recommends the user to use cutadapt.
I used cutadapt to remove the 5' sequence in front of the gRNA (20 nt) of interest (there are 12 different guides) but some reads with the complete gRNA are not trimmed because there are big deletions. I tried to increase the maximum error rate of cutadapt but the Mageck gRNA count obtained after the trimming is sligthly smaller than the obtained searching each gRNA with grep.
Is it possible to improve the trimming in order to only obtain the gRNAs sequences (the 3' is not necessary to be eliminated)?
This topic is similar to my problem: https://github.com/marcelm/cutadapt/issues/261
If I am understanding this right you may be able to use
bbduk.sh
from BBMap suite. Find thegRNA
sequence usingliteral=gRNA_sequence ktrim=l
. This will find the gRNA and then remove everything 5' to that hit.The problem is that I have to analyse data generated by 60,000 gRNAs (all of them have to be searched in each fastq). Is it possible to do it with that number of gRNAs?
Should be possible. Put the sequences as multi-fasta in a file and then feed the file to
bbduk.sh
withref=gRNA_file.fa
. Bump the memory you assign by setting a higher number with-Xms20g
(for example).This code works (it finds the sequences) but it removes the gRNA sequence... I would be grateful if you could answer again.
Take a look at BBduk guide here to see if you can make it work for your use case. Sounds like you want to keep the
gDNA
sequence so this is not quite the intended use for bbduk in trim mode.Yes. that's the problem but I took a look to the guide and I didn't find any option to do what I want to do.