Question

When paired end reads doesn't join, can i consider them single-end in metagenomic analysis?

0

Entering edit mode

6.5 years ago

fernandalpcosta • 0

Hello all,

I'm doing a metagenomic analysis on human samples that were sequenced by Miseq platform, paired end reads, 40 to 300pb long (it varies according to FastQC report). After the trim/quality control fase, i ended up with reads of 60 to 190 bp long, but those paired end reads aren't able to join anymore. :(

I thought about 2 solutions:

Consider all reads (R1 and R2) single end reads, concatenate them all and proceed with the metagenomic analysis with QIIME, or;
Use only forward (R1) or reverse (R2) reads for the downstream analysis with QIIME.

Is there a 3th and better option for it? Does any of the solutions i thought about make sense?

Thank you all very much for any help or tips you can provide me.

Fernanda Luz

sequencing metagenomics qiime • 3.6k views

ADD COMMENT • link updated 6.1 years ago by h.mon 35k • written 6.5 years ago by fernandalpcosta • 0

1

Entering edit mode

You should always merge the reads first (you can adapter trim but NOT quality trim). After merging the reads you could quality trim if needed. BBMerge from BBMap suite is what I recommend that you try for merging unelss you are following an established pipeline line QIIME or Mothur.

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

Thank you for your reply, i will give it a try with BBMerge on the raw reads. I tried to join the reads before trimming them, but i got less than 10% joined reads (i used fastq-join from QIIME pipeline).

ADD REPLY • link 6.5 years ago by fernandalpcosta • 0

0

Entering edit mode

If you expect the reads to join (the way you designed your amplicons) and if they are not then something may be wrong with the libraries. If your inserts are short and you have a lot of adapter contamination then you are going to get short reads after adapters are removed. They should overlap but will not give you long reads you expect.

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

Another problem is that i don't have the information about the library preparation, as the size of the overlap region between R1 and R2. I used fastq-join in the default mode. According to FastQC report, i don't have adapter contamination, just a really bad score at the end of the reads.

ADD REPLY • link 6.5 years ago by fernandalpcosta • 0

0

Entering edit mode

Did you have a chance to try bbmerge.sh? Curious to see if it helps improve things.

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

I'm going to try it today. As soon as i get a result, i will let you know. Thank you :)

ADD REPLY • link 6.5 years ago by fernandalpcosta • 0

0

Entering edit mode

Start with the original data for first bbduk try. Do not worry about bad quality at ends of reads.

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

I ran bbmerge.sh today in one of my samples and got this report:

BBMerge version 37.58

Writing mergable reads merged.

Started output threads.

Total time: 1.617 seconds.

Pairs:                          54443

Joined:                 3442        6.322%

Ambiguous:              45310       83.225%

No Solution:            5691        10.453%

Too Short:              0           0.000%


Avg Insert:             456.8

Standard Deviation:     21.3

Mode:                   460

Insert range:           63 - 489

90th percentile:        460

75th percentile:        460

50th percentile:        460

25th percentile:        460

10th percentile:        443

Not looking good right?

About the bbduk.sh, i don't have the contaminant files for the 'ref' parameter.

ADD REPLY • link 6.5 years ago by fernandalpcosta • 0

0

Entering edit mode

For the adapters (if that is what you meant by contaminants) the sequences are available in a file called adapters.fa in the resources directory in BBMap software bundle.

You have a high percentage of Ambiguous: 45310 83.225% overlaps. That could signify that there may be simple repeats like ACACACACAAC at the ends of reads which make overlapping them difficult.

BTW: You don't appear to have many reads. 54443 is relatively a tiny amount of data. Is that correct?

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

Unfortunately is it correct, my data is really small.

I ran bbduk and it only removed 6 reads from my data. Then i reran bbmerge, as expected, got no change at those 6% of merged reads.

It does look like a little like a lost cause right? Should i consider it a single end reads data and continue the analysis with only R1 reads?

ADD REPLY • link 6.5 years ago by fernandalpcosta • 0

1

Entering edit mode

At least we can be sure that there is likely no adapter contamination your data.

Can you take a sample of reads (from R1/R2 file) and blast them at NCBI? Confirm that they are retrieving hits to the right sequences. It is possible that something went wrong somewhere in the experimental procedure and you don't have the right kind of data.

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

Hey Fernanda, Is there any particular reason why the data is of poor quality (according to FastQC)? Are you running FastQC with default parameters? It seems a waste to throw away such long reads.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you for your reply Kevin. I ran FastQC with default parameters, yes.

This is a real example of one of my libraries' FastQC report (just the per base sequence quality):

R1 example

R2 example

ADD REPLY • link updated 6.5 years ago by GenoMax 141k • written 6.5 years ago by fernandalpcosta • 0

0

Entering edit mode

Wow, the quality is very poor at the 3` end. Yes, now I can understand the trimming to ~190bp max. I think that the quality is poorer than it should be. You should check to ensure that nothing was at error during the sequencing process.

The other comments (acima/above) também te ajudarão.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

score 0 · Answer 1 · 2018-03-13

Another problem is that i don't have the information about the library preparation, as the size of the overlap region between R1 and R2.

The primers have been removed already? Which primer pair did you use? What is the size of your amplicon?

Several (all?) workflows demand primers to be removed in order to perform the taxonomical assignment, you should not proceed if you don't know if the primers have been removed or not. You should either get these answers from the person who gave you the data, or you could do some detective work: you could check if standard primers are present at the beginning of your reads, then you would know the expected amplicon size.

Do you know if phiX has been added to your sequencing run, or if a laddered primers has been used? The poor quality of your reads may be due to low complexity of the samples. In any case, I would seriously consider if it is worth to proceed analysis with this data, as it seems really bad.