I've received my sequences from Illumina NextSeq in fastq files (R1-15bp R2-77bp), I'm trimming off the end of R2 where Qscore is low, but I also suspect the low mapping-% could be contributed by all the N's in my sequences.
I'd like to get rid of sequences with excessive N's before mapping.
@NS500799:417:HHMWHBGX7:1:11101:10296:1148:UMI:GAGACT: GCCCTGCCTTCAAATGAAANNAGTTCAGCATGCCC + AAAAAEEEEEE66<EE/66EE<EEA6/A//EAAEE @NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC: CGCGCCGCAAGACTGGTACACAATGACTGAAATGA + AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/ @NS500799:417:HHMWHBGX7:1:11101:22941:1238:UMI:CCGTTT: GCGCAGAGTTTAAACGCGAATNNGCTCAACTGGTT + AA/AA//EEEEEEEEEEEEEEEEEEEAE/EEEEEE
@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC: CGCGCCGCAAGACTGGTACACAATGACTGAAATGA + AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/
Q1: I can't get the right command to do this. Something like
grep -v "$(grep -A 2 -B 1 "NN" in.fastq)" in.fastq
sort describes what I want to do, but undesirably gets rid of the +'s for me. Any ideas?
Q2: As R1 & R2 are paired reads coming as separate fastq files, if I remove the sequences with N's in R2, how does this affect the demultiplexing and bowtie mapping step? Should I be removing corresponding sequences in R1 too? Unfortunately I don't understand the py script (CEL-Seq2 by Yanai Lab) well enough to determine this.