Biostar Beta. Not for public use.
Illumina paired-end reads R1 and R2 mixed together?
1
Entering edit mode
3.0 years ago
lvogel • 20
@lvogel9382

Hi, I have received fastq files containing the reads from Illumina MiSeq. Since they are paired-end, there is an R1 and an R2 file for each sample. So I expected to find reads beginning with our forward primer in the R1 files, and reads beginning with our reverse primer in the R2 (or vice versa). However, I find both in both; i.e. about half of the reads in the R1 files begin with the forward primer, and half with the reverse primer; and same with the R2s. I tried merging them, but this results in about half of the reads being reverse complemented, and this makes things more complicated downstream, so I would like them to all go in the same direction. I thought to grep for each of the primers, but because of ambiguities and some still having short tags on the beginning, I don't think it's going to work--plus I thought they weren't supposed to be mixed anyway...??? Maybe I don't understand this as well as I thought. Any ideas? Thanks.

Illumina metabarcoding • 3.9k views
1
Entering edit mode

Could you elaborate more on the library was prepared?

1
Entering edit mode

It sounds like your amplicon library was constructed by the standard Illumina method (i.e., adaptor ligation) and sequenced with standard Illumina (adaptor) primers. If so, then you'd expect a 50/50 mix of amplicon orientations. But @WouterDeCoster is correct, we'll need more details about library prep (e.g., what are the short tags to which you refer) to help you parse the data.

2
Entering edit mode

The primer sequences you use in the sequencing step, use the adaptors you link to your fragmented DNA or cDNA)

And the joining of these adapters to these pieces of DNA is fully random (don't get into consideration direction) excepting when you are using a stranded transcriptomic protocol

If using genomic sequences, I am not aware of a protocol that will allow you to get directional libraries, though

0
Entering edit mode

Just plain (multiplex) PCR based enrichment & library prep can be directional.

0
Entering edit mode

Have you tried to scan the data with a trimming program? I suggest bbduk.sh from BBMap suite. You may have inserts that at smaller than the length of sequencing. While you are at it you could also use bbmerge.sh from the same suite to see what you get in terms of merging of R1/R2 reads.

3
Entering edit mode
3.9 years ago
jomo018 • 480
@jomo01823839

Actually the reads are always mixed just the way you describe them. R1 may be forward or reverse. R2 may also be forward or reverse. You are only guaranteed that the pairs are complementary. Depending on your requirements, you may indeed need to check which is which down the pipeline. Stsndard alignment utilities do that automatically.

0
Entering edit mode

jomo018, since it's barcoding, I don't think I'm using any of the standard alignment utilities you're referring to--could you give some examples?

Also, I'll add the following details which were asked for, in case someone finds them useful:

• Nextera Indices Kit was used, with i5 and i7, for multiplexing.
• The primers we use also have their own old barcodes, apparently.
• By "tags"--maybe these aren't tags per se, but there is sometimes TCAT occurring before the forward primer, and GGAG occuring before the reverse primer.

Here is what I got from BBDuk:

Input is being processed as paired

Input: 178704 reads 51661997 bases. KTrimmed: 23191 reads (12.98%) 605133 bases (1.17%) Total Removed: 2 reads (0.00%) 605133 bases (1.17%) Result: 178702 reads (100.00%) 51056864 bases (98.83%)

Here is from BBMerge:

Pairs: 89351 Joined: 23783 26.617% Ambiguous: 28013 31.352% No Solution: 37555 42.031% Too Short: 0 0.000%

Avg Insert: 357.7 Standard Deviation: 12.2 Mode: 365

Insert range: 52 - 425 90th percentile: 365 75th percentile: 365 50th percentile: 365 25th percentile: 339 10th percentile: 339

1
Entering edit mode

Are you saying that there are two types of indexes in this experiment (at level 1 - Illumina nextera and once samples are demultiplexed into those pools, there are "inline" barcodes that further split nextera pools into individual samples)?

Your inserts are of a good size and there are no primer dimers so the data looks good at that level.

0
Entering edit mode

genomax2, unfortunately I'm not sure. Could it be that the short things I thought were tags (TCAT and GGAG) are "inline" barcodes, since they are so short? Because since there is only one forward and one reverse, they aren't serving any purpose (no further splitting down of the pools). They are sometimes there and sometimes not, which just makes things more complicated for me.

1
Entering edit mode

Do you perhaps have primer sequences which were used to amplify targets?

Did I understand correctly that in a first PCR targets are selectively amplified using tagged primers, followed by an universal PCR to add barcodes and illumina adapters?

0
Entering edit mode

Yes! :) I should have mentioned that too. (I only added a tag that said "metabarcoding") The target is a segment of the CO1 gene. So since I'm not doing genome assembly, the suggestion that standard assembly utilities will specify which of my reads are in the RC direction might not help. Although I've already accepted an answer, I kind of asked two questions in this post. I understand now that the mixture of directions is normal. The question still remains about how to get all my reads to go in the same direction, to make things easier downstream. I'll post it as a new question if necessary.

0
Entering edit mode

You can use reformat.sh from BBMap to reverse-complement the reads. Two options you are looking for are.

rcomp=f                 (rc) Reverse-compliment reads.
rcompmate=f             (rcm) Reverse-compliment read 2 only.

0
Entering edit mode

Thanks, I'll try it tomorrow when I can access my data & upvote you if it does what I need.

0
Entering edit mode

Either I don't understand it correctly, or it doesn't do what I want. I tried like this:

bash reformat.sh in=merged.fq out=mergedr.fq rcomp


and variations of rcomp and rcompmate, but it either just reverse complements all of them or none of them. Apparently, reverse complementing only the reads that are in a different direction than the other reads is not commonly done, based on some of the responses here.

1
Entering edit mode

reverse complementing only the reads that are in a different direction than the other reads is not commonly done

That is correct. You may need to identify reads that map to one strand or other (Forward Stand Or Reverse Strand ), isolate them (you could do that using filterbyname.sh) and then do RC using reformat.sh.

All of my sequences should be approximately the same length

Curious why that is a requirement.

0
Entering edit mode

OK, thanks. Now I see how it would need to be done. And without SAM/BAM files, it might be more work than it's worth.

Curious why that is a requirement.

We use primers that amplify a 313-bp coding region of a gene, so this region really shouldn't vary in length by more than a few bp. For clustering into OTUs, fragments should first be trimmed to the same ~313-bp region. I suppose this is also why I don't have BAM files, just gzipped FASTQs, since the data isn't as big in barcoding as in genomics.

1
Entering edit mode

Wouldn't you rather first map the data, then slice the data by expected positions to obtain equal read lengths?

0
Entering edit mode

That's a good point there. It's just I've never thought to map my sequences to anything, because since they are environmental samples, there are many species of multiple orders or even classes that will have been amplified. So I don't know what I would use for a reference. From BLASTing, I know that most of the matches to the database are relatively low percent identical. ...But still I'm curious to look into this method now.

0
Entering edit mode

I don't know why you would need them in the same direction. If you are just doing standard mapping and variant calling there is no issue. Although I don't really know what the downstream analysis is.

0
Entering edit mode

All of my sequences should be approximately the same length, which sometimes necessitates trimming bases from the left and/or right, and this is easier to do correctly when they are all in the same direction. And useful for graphical visualization. And it makes clustering into OTUs slightly more accurate.

0
Entering edit mode
3.0 years ago
gb • 780
@gb41746

I use FLASH to merge the reads, really easy to use.

Other options are:

PEAR

vsearch/usearch -fastq_mergepairs

It depends on the sequence length, but you can first merge the reads and after that trim the primers

0
Entering edit mode

Ah, I took a quick look at it, and they didn't compare VSEARCH, which is what I'm using these days. I've recently found an interesting solution to my original question. I wanted to put all the merged reads in the same orientation, for the next steps of the pipeline, e.g. dereplication, BLAST-ing. So I use the fastx_revcomp command of VSEARCH to flip all of them around, and then cutadapt to remove primers from the combined file of original and reverse complemented reads, with the --discard-untrimmed option, so that everything that didn't have the primer, which is mostly the ones in the unwanted direction, get deleted.