Removing Reads Which Have Adapter Sequences
2
0
Entering edit mode
11.0 years ago
Varun Gupta ★ 1.3k

Hi

I would like to remove all those reads from my fastq file which has adapter sequence in it. Which tool or software or any unix command line options should be good for removing the reads

PS: i don't want to trim the adapters, want to remove that reads which have adapter seq from fastq file.

Seq of adapter: ACTAGTGTAGTCGTACTGATCT

Hope to hear from you soon

Regards

VARUN

adaptor • 5.8k views
ADD COMMENT
2
Entering edit mode
11.0 years ago

Assuming all of your reads are of the same length, you can use any of the existing read trimmers that allow a minimum read length option (e.g. trim_galore with the --length option). Then, just have the program reject any trimmed reads, since they'll be shorter than whatever the initial read length was. For example, if you have 100bp reads then running

trim_galore -a adapter --length 100 file.fastq

or something like that should do what you want. This has the benefit of being able to handle paired-end reads (presuming you want to filter out both of the pairs).

ADD COMMENT
0
Entering edit mode

Hi This would also trim reads which don't have adapter sequence but have poor quality at the ends. I dont want to trim those reads. How to go about it

Varun

ADD REPLY
0
Entering edit mode

Only if you want it to. You can set whatever quality trimming threshold you want. Try it with -q 0

ADD REPLY
0
Entering edit mode
11.0 years ago
bioinfo ▴ 830

Have you tried the fastX toolkit. There is a function fastx_clipper which can be used to remove the adapter sequences. Here it is

$ fastx_clipper -h
usage: fastx_clipper [-h] [-a ADAPTER] [-D] [-l N] [-n] [-d N] [-c] [-C] [-o] [-v] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-a ADAPTER] = ADAPTER string. default is CCTTAAGG (dummy adapter).
   [-l N]       = discard sequences shorter than N nucleotides. default is 5.
   [-d N]       = Keep the adapter and N bases after it.
          (using '-d 0' is the same as not using '-d' at all. which is the default).
   [-c]         = Discard non-clipped sequences (i.e. - keep only sequences which contained the adapter).
   [-C]         = Discard clipped sequences (i.e. - keep only sequences which did not contained the adapter).
   [-k]         = Report Adapter-Only sequences.
   [-n]         = keep sequences with unknown (N) nucleotides. default is to discard such sequences.
   [-v]         = Verbose - report number of sequences.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.
   [-z]         = Compress output with GZIP.
   [-D]        = DEBUG output.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
ADD COMMENT

Login before adding your answer.

Traffic: 2472 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6