Biostar Beta. Not for public use.
Removing N's from NextSeq reads
0
Entering edit mode
2.5 years ago
exin • 50
@exin48044

Hi,

I've received my sequences from Illumina NextSeq in fastq files (R1-15bp R2-77bp), I'm trimming off the end of R2 where Qscore is low, but I also suspect the low mapping-% could be contributed by all the N's in my sequences.
I'd like to get rid of sequences with excessive N's before mapping.

Sample fastq:

@NS500799:417:HHMWHBGX7:1:11101:10296:1148:UMI:GAGACT:
GCCCTGCCTTCAAATGAAANNAGTTCAGCATGCCC
+
AAAAAEEEEEE66<EE/66EE<EEA6/A//EAAEE
@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/
@NS500799:417:HHMWHBGX7:1:11101:22941:1238:UMI:CCGTTT:
GCGCAGAGTTTAAACGCGAATNNGCTCAACTGGTT
+
AA/AA//EEEEEEEEEEEEEEEEEEEAE/EEEEEE

Desired output:

@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/

Q1: I can't get the right command to do this. Something like

grep -v "$(grep -A 2 -B 1 "NN" in.fastq)" in.fastq

sort describes what I want to do, but undesirably gets rid of the +'s for me. Any ideas?

Q2: As R1 & R2 are paired reads coming as separate fastq files, if I remove the sequences with N's in R2, how does this affect the demultiplexing and bowtie mapping step? Should I be removing corresponding sequences in R1 too? Unfortunately I don't understand the py script (CEL-Seq2 by Yanai Lab) well enough to determine this.

Thank you!!

RNA-Seq sequence cel-seq unix • 715 views
ADD COMMENTlink
3
Entering edit mode

This is not how you should removing the N's. By deleting cycles in the middle of the read you are going to screw up the reading frame. If you want to remove N's you should then trim the read all the way to the 3'-end. If all of your reads have N's in that specific position then it is possible that the sequencer experienced a hardware/software issue during the run. Illumina generally replaces reagents for runs that fall in this category (as long as you have a maintenance contract) so your sequence provider should run this sample at no charge. Edit: I see that this is a scRNA run. So I would be inclined to think that this run was likely overloaded and did not have enough phiX spike-in.

ADD REPLYlink
0
Entering edit mode

I'm not sure if I'm missing something, can you explain how what I wanted to do would screw up the reading frame? I'm not removing just the Ns, but discarding entire reads from the mapping step, as long as they contain consecutive N's. These N's are still there after trimming.

Thanks for pointing out the likely cause! Speaking to the sequencing facility to see if this was the issue...

ADD REPLYlink
0
Entering edit mode

My apologies for not looking at your example carefully. I thought you were removing N's leaving the rest of the sequence intact since you posted a sequence titled desired output. I edited my comment above.

ADD REPLYlink
4
Entering edit mode
2.5 years ago
toralmanvar • 760
@toralmanvar40472

You can download prinseq and can use prinseq-lite.pl script in it to remove sequences containing desired number of "Ns". You can use parameter command having following paramters:

prinseq-lite.pl -fastq test_R1.fastq -fastq2 test_R2.fastq -ns_max_n (number of 'N')

I've received my sequences from Illumina NextSeq in fastq files (R1-15bp R2-77bp),

But just wondering, how come your R1 file is having only 15bp? You are working on which application?

ADD COMMENTlink
0
Entering edit mode

A shorter read 1 is normal in scRNAseq, since it's mostly UMI and cell barcodes followed by a stretch of AAAAAAAAAA...

ADD REPLYlink
0
Entering edit mode

Ok, I have never worked on scRNAseq so didn't know this.

ADD REPLYlink
0
Entering edit mode

Thank you!! Will look into prinseq.

Also still wondering if it's possible to achieve this with just bash commands? It seems do-able I'm just not proficient at it.

Yes I did CEL-Seq. R1 is for UMIs and barcodes only.

ADD REPLYlink
2
Entering edit mode
2.5 years ago
@Pierre Lindenbaum30

? Should I be removing corresponding sequences in R1 too?

yes

Any ideas?

linearize both fastq R1 and R2, filter with awk, convert back to interleaved fastq

paste <(gunzip -c S1.R1.fq.gz | paste - - - -)  <(gunzip -c S1.R2.fq.gz | paste - - - -) | awk '!($2 ~ /N/ || $6 ~ /N/ )'  | tr "\t" "\n"

and if your software cannot handle interleaved fastq, see deinterleave fastq file

ADD COMMENTlink
0
Entering edit mode

Thanks Pierre! I'm a bioinformatics newbie, could I trouble you to expand your answer a bit more? How do I linearise R1 and R2 with awk? And how does the cleaned-R2's come into these steps?

ADD REPLYlink
1
Entering edit mode

you don't linearize with awk, you linearize the 4 lines of each fastq with paste.

$ seq 1 8
1
2
3
4
5
6
7
8

$ seq 1 8 | paste - - - -
1   2   3   4
5   6   7   8

you do this with both R1/R2.

awk is used to test if column 2 or column 6 contains any 'N'

ADD REPLYlink
0
Entering edit mode

Great. Thanks for the elaboration!!

ADD REPLYlink
2
Entering edit mode
2.5 years ago
@cpad011220673
fastp -i test.fq -n 1 -o test.n.fq 

$ cat test.n.fq 
@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/

fastp is in brew, conda repos. n is number of minimum number of Ns tolerated. In this case 1. Any read with more than one N, will be removed.

Syntax for paired end (copy/pasted from fastp github) and try adapting above code for paired end:

fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz
ADD COMMENTlink
1
Entering edit mode
2.5 years ago
@Vijay Lakhujani26377

Alternate solution using seqkit

seqkit grep -srv -p 'N' sample.fastq
ADD COMMENTlink
1
Entering edit mode
2.5 years ago
@cpad011220673

Q1 using only grep:

$ grep --no-group-separator  -B 1 "NN" test.fq | grep '\@' | grep  -v -f - <(grep '\@' test.fq) | grep -A 3 -f - test.fq
@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/
ADD COMMENTlink
0
Entering edit mode

This works! Thank you!

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.3