Question

Unable to find adapter seq using Fastp

0

Entering edit mode

5.5 years ago

divyojsingh • 0

Hi!

We are doing riboseq analysis on some fastq files. We are using Fastp for adapter identification and removal. But, we are facing some problems. fastp is able to remove the adapter sequences only if the adapter seq is given in the command line as an argument. It is not able to detect the adapter and remove it by itself and is giving the output as adapter is not detected. The sra file used, adapter seq and codes are as follows:

SRA file : SRR810103 (fastq size 31.6GB)

adapter seq: TCGTATGCCGTCTTCTGCTTG

./fastp -i ./SRR810103.fastq -a TCGTATGCCGTCTTCTGCTTG -q 33 -w 10 -o./output.fastq

./fastp -i ./SRR810103.fastq -w 10 -o ./output.fastq

Can you please help us?

Eagerly waiting for your help!

sequencing sequence assembly software error fastp • 5.4k views

ADD COMMENT • link updated 3.0 years ago by Dunois ★ 2.5k • written 5.5 years ago by divyojsingh • 0

2

Entering edit mode

Have you considered the possibility that the data is already trimmed. This may be especially true if the reads are not all equal length.

As an alternative you could try bbduk.sh from BBMap suite with literal=your_adapter_seq option. A guide for BBduk is available here.

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

Thanks for your suggestion. Actually all the reads in the file have a length of 51 nucleotides, hence i do not think that data has already been trimmed of the adapters. (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR810103) Also i want to use the automatic adapter detection feature of fastp so that i can generalize my workflow for any ribo-seq data sets, therefore I want to use fastp so that it automatically detects and removes if any adapters are present.

ADD REPLY • link 5.5 years ago by divyojsingh • 0

0

Entering edit mode

you can try trim_galore, it will satisfiy.

ADD REPLY • link 5.3 years ago by 787685856 • 0

0

Entering edit mode

Hi,

I'm meeting the same problem. The paper didn't provide the adapter sequence. Is there a effective tool to detect adapters automatically, could you help me solve this problem?

Thank you very much!

ADD REPLY • link 3.0 years ago by Hongyan • 0

0

Entering edit mode

If you can find out what kit was used for preparing the libraries then it may be simple to use the adapter for that kit. Your data need not necessarily have adapter sequences so unless you are planning to do de novo assemblies you could let aligners take care of any adapter sequences by soft-clipping at time of alignment.

ADD REPLY • link 3.0 years ago by GenoMax 141k

0

Entering edit mode

Run fastqc to see whether and which adapter there is, then google for the sequence and provide it to the tool.

https://emea.support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html

ADD REPLY • link 3.0 years ago by ATpoint 81k

0

Entering edit mode

Thanks! The fastqc result shows no adapter. fastqc didn't work.

ADD REPLY • link 3.0 years ago by Hongyan • 0

0

Entering edit mode

If it shows no adapter then there are none and you do not need any trimming. If (unikely) there is an adapter that fastqc does nt recognize then it should still show up as a overrepresented sequence. If there are neither overrepresented sequences nor detected adapters then the data are good to be aligned right away without further processing I'd say.

ADD REPLY • link 3.0 years ago by ATpoint 81k

0

Entering edit mode

fastp does have an automatic adapter detection option (I believe this is its default behavior).

ADD REPLY • link 3.0 years ago by Dunois ★ 2.5k

score 2 · Answer 1 · 2021-04-28

One way to probe the adapter content is to slice and group the ends of the reads. I do this often as a quick sanity check. It is a simple way to detect possible systematic contamination that starts at a give coordinate. Get a dataset (251bp long) reads:

fastq-dump -X 100000 SRR519926 --split-files

For example to see if there are 30 bp long common sequences starting at base 210 you could do a

cat SRR519926_1.fastq | bioawk -c fastx '{print substr($seq, 210, 30) }' | sort | uniq -c | sort -rn | head

the command above uses bioawk but you can use regular awk was well then remove the nonsequence related entries, the output is:

  47 TCGGAAGAGCACACGTCTGAACTCCAGTCA
  47 GGAAGAGCACACGTCTGAACTCCAGTCACG
  45 ACACGTCTGAACTCCAGTCACGTAACATCA
  41 GATCGGAAGAGCACACGTCTGAACTCCAGT
  37 ACTCCAGTCACGTAACATCATCTCGTATGC
  36 GAGCACACGTCTGAACTCCAGTCACGTAAC
  36 CAGTCACGTAACATCATCTCGTATGCCGTC
  36 CACGTCTGAACTCCAGTCACGTAACATCAT
  36 AGATCGGAAGAGCACACGTCTGAACTCCAG
  36 AGAGCACACGTCTGAACTCCAGTCACGTAA

A little eyeballing tells us the most sequences are overlapping and appear to be different substrings of a much longer adapter sequences. For a more "proper" solution, extract and align the ends to see if these sequences overlap.