I've received my sequences from Illumina NextSeq in fastq files (R1-15bp R2-77bp), I'm trimming off the end of R2 where Qscore is low, but I also suspect the low mapping-% could be contributed by all the N's in my sequences.
I'd like to get rid of sequences with excessive N's before mapping.
I can't get the right command to do this.
grep -v "$(grep -A 2 -B 1 "NN" in.fastq)" in.fastq
sort describes what I want to do, but undesirably gets rid of the +'s for me. Any ideas?
As R1 & R2 are paired reads coming as separate fastq files, if I remove the sequences with N's in R2, how does this affect the demultiplexing and bowtie mapping step? Should I be removing corresponding sequences in R1 too? Unfortunately I don't understand the py script (CEL-Seq2 by Yanai Lab) well enough to determine this.