Biostar Beta. Not for public use.
Question: Removing N's from NextSeq reads
0
Entering edit mode

Hi,

I've received my sequences from Illumina NextSeq in fastq files (R1-15bp R2-77bp), I'm trimming off the end of R2 where Qscore is low, but I also suspect the low mapping-% could be contributed by all the N's in my sequences.
I'd like to get rid of sequences with excessive N's before mapping.

Sample fastq:

@NS500799:417:HHMWHBGX7:1:11101:10296:1148:UMI:GAGACT:
GCCCTGCCTTCAAATGAAANNAGTTCAGCATGCCC
+
AAAAAEEEEEE66<EE/66EE<EEA6/A//EAAEE
@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/
@NS500799:417:HHMWHBGX7:1:11101:22941:1238:UMI:CCGTTT:
GCGCAGAGTTTAAACGCGAATNNGCTCAACTGGTT
+
AA/AA//EEEEEEEEEEEEEEEEEEEAE/EEEEEE

Desired output:

@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/

Q1: I can't get the right command to do this. Something like

grep -v "$(grep -A 2 -B 1 "NN" in.fastq)" in.fastq

sort describes what I want to do, but undesirably gets rid of the +'s for me. Any ideas?

Q2: As R1 & R2 are paired reads coming as separate fastq files, if I remove the sequences with N's in R2, how does this affect the demultiplexing and bowtie mapping step? Should I be removing corresponding sequences in R1 too? Unfortunately I don't understand the py script (CEL-Seq2 by Yanai Lab) well enough to determine this.

Thank you!!

ADD COMMENTlink 19 months ago exin • 50 • updated 19 months ago cpad0112 11k
Entering edit mode
3

This is not how you should removing the N's. By deleting cycles in the middle of the read you are going to screw up the reading frame. If you want to remove N's you should then trim the read all the way to the 3'-end. If all of your reads have N's in that specific position then it is possible that the sequencer experienced a hardware/software issue during the run. Illumina generally replaces reagents for runs that fall in this category (as long as you have a maintenance contract) so your sequence provider should run this sample at no charge. Edit: I see that this is a scRNA run. So I would be inclined to think that this run was likely overloaded and did not have enough phiX spike-in.

ADD REPLYlink 19 months ago
genomax
68k
Entering edit mode
0

I'm not sure if I'm missing something, can you explain how what I wanted to do would screw up the reading frame? I'm not removing just the Ns, but discarding entire reads from the mapping step, as long as they contain consecutive N's. These N's are still there after trimming.

Thanks for pointing out the likely cause! Speaking to the sequencing facility to see if this was the issue...

ADD REPLYlink 19 months ago
exin
• 50
Entering edit mode
0

My apologies for not looking at your example carefully. I thought you were removing N's leaving the rest of the sequence intact since you posted a sequence titled desired output. I edited my comment above.

ADD REPLYlink 19 months ago
genomax
68k
4
Entering edit mode

You can download prinseq and can use prinseq-lite.pl script in it to remove sequences containing desired number of "Ns". You can use parameter command having following paramters:

prinseq-lite.pl -fastq test_R1.fastq -fastq2 test_R2.fastq -ns_max_n (number of 'N')

I've received my sequences from Illumina NextSeq in fastq files (R1-15bp R2-77bp),

But just wondering, how come your R1 file is having only 15bp? You are working on which application?

ADD COMMENTlink 19 months ago toralmanvar • 760
Entering edit mode
0

A shorter read 1 is normal in scRNAseq, since it's mostly UMI and cell barcodes followed by a stretch of AAAAAAAAAA...

ADD REPLYlink 19 months ago
Devon Ryan
90k
Entering edit mode
0

Ok, I have never worked on scRNAseq so didn't know this.

ADD REPLYlink 19 months ago
toralmanvar
• 760
Entering edit mode
0

Thank you!! Will look into prinseq.

Also still wondering if it's possible to achieve this with just bash commands? It seems do-able I'm just not proficient at it.

Yes I did CEL-Seq. R1 is for UMIs and barcodes only.

ADD REPLYlink 19 months ago
exin
• 50
2
Entering edit mode

? Should I be removing corresponding sequences in R1 too?

yes

Any ideas?

linearize both fastq R1 and R2, filter with awk, convert back to interleaved fastq

paste <(gunzip -c S1.R1.fq.gz | paste - - - -)  <(gunzip -c S1.R2.fq.gz | paste - - - -) | awk '!($2 ~ /N/ || $6 ~ /N/ )'  | tr "\t" "\n"

and if your software cannot handle interleaved fastq, see deinterleave fastq file

ADD COMMENTlink 19 months ago Pierre Lindenbaum 120k
Entering edit mode
0

Thanks Pierre! I'm a bioinformatics newbie, could I trouble you to expand your answer a bit more? How do I linearise R1 and R2 with awk? And how does the cleaned-R2's come into these steps?

ADD REPLYlink 19 months ago
exin
• 50
Entering edit mode
1

you don't linearize with awk, you linearize the 4 lines of each fastq with paste.

$ seq 1 8
1
2
3
4
5
6
7
8

$ seq 1 8 | paste - - - -
1   2   3   4
5   6   7   8

you do this with both R1/R2.

awk is used to test if column 2 or column 6 contains any 'N'

ADD REPLYlink 19 months ago
Pierre Lindenbaum
120k
Entering edit mode
0

Great. Thanks for the elaboration!!

ADD REPLYlink 19 months ago
exin
• 50
2
Entering edit mode
fastp -i test.fq -n 1 -o test.n.fq 

$ cat test.n.fq 
@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/

fastp is in brew, conda repos. n is number of minimum number of Ns tolerated. In this case 1. Any read with more than one N, will be removed.

Syntax for paired end (copy/pasted from fastp github) and try adapting above code for paired end:

fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz
ADD COMMENTlink 19 months ago cpad0112 11k
1
Entering edit mode

Alternate solution using seqkit

seqkit grep -srv -p 'N' sample.fastq
ADD COMMENTlink 19 months ago Vijay Lakhujani 4.1k
1
Entering edit mode

Q1 using only grep:

$ grep --no-group-separator  -B 1 "NN" test.fq | grep '\@' | grep  -v -f - <(grep '\@' test.fq) | grep -A 3 -f - test.fq
@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/
ADD COMMENTlink 19 months ago cpad0112 11k
Entering edit mode
0

This works! Thank you!

ADD REPLYlink 19 months ago
exin
• 50

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0