Question

Efficient way to perform fastq filtering of contaminants by read name prior to transcriptome assembly?

0

Entering edit mode

3.0 years ago

jfaberha ▴ 50

Hi everyone. I have paired-end transcriptome fastq files from 44 samples and I'm hoping to build de novo assemblies for each sample. The problem is there seems to be a fair amount of contamination in many of our samples that represents bacterial, human, and rodent sequence (our species is a teleost fish, btw).

I discovered this contamination by first mapping to a reference genome (incomplete) for DE analysis then building a de novo transcriptome with Trinity from the pooled unmapped reads to find genes from genome gaps. Contaminant sequences were enriched in this assembly to the point where we accidentally built a partial mouse transcriptome.

For a follow up analysis, I want to build de novo transcriptomes for each sample to compare certain gene variants between experimental groups, but I want to remove contamination first so I don't get weird hybrid contigs from orthologous genes across species. At this point I know exactly which contigs from the unmapped read assembly represents contamination and I know the reads from the unmapped read pool that were used to build these contigs. Now I want to remove these reads from the respective samples prior to building sample-specific transcriptome assemblies.

I'm trying to using the filterbyname.sh script from BBmap with the command below, but it's running extremely slowly (4 days and counting for a single sample).

filterbyname.sh in=read1.fastq in2=read2.fastq out=filt_read1.fastq out2=filt_read2.fastq names=non-fish_reads.txt include=f substring=name

The "non-fish_reads.txt" is a list of about 22 million unique fastq headers from across all 44 samples that represent apparent contaminants. I wish I had sample-specific contaminant lists, but they came from the pooled assembly and I'm assuming finding which sample they belong to may take almost as much time as simultaneously filtering them.

Can anyone recommend possible faster workarounds or alternative strategies to BBmap filtering? I was thinking of aligning raw fastq files to the contaminant contigs at low stringency and exporting unmapped reads for de novo assembly, but I worry that this method might still miss contaminant reads with overhangs, base miscalls, etc. It would also work if I could feed this read list directly to an assembler as the reads to ignore, but I can't find that option for Trinity.

Thanks in advance.

trinity BBmap transcriptome filtering assembly • 2.3k views

ADD COMMENT • link updated 3.0 years ago by GenoMax 141k • written 3.0 years ago by jfaberha ▴ 50

0

Entering edit mode

If you have exact names do not use substring. Can you show us example of what non-fish_reads.txt looks like? To be fair this is a large job but it should not take 4 days for 1 sample.

ADD REPLY • link 3.0 years ago by GenoMax 141k

0

Entering edit mode

I've been playing around with it today and I restarted after changing "substring=name" to "prefix=t", but I'm not sure if that is making it sufficiently faster yet. The contents of "non-fish_reads.txt" are formatted like this:

K00124:567:HF5W5BBXX:1:1101:10003:13869

which was reformatted from Trinity files where they were listed like this:

>K00124:567:HF5W5BBXX:1:1101:10003:13869/1

I removed the ">" character and the mate-pair info for two reasons. First, I would like to filter both members of the pair and second, because the format from the Trinity output truncated a portion of the original fastq header and I wasn't sure it would be recognized if the entire line didn't match. The header in the original fastq is formatted like this:

@K00124:567:HF5W5BBXX:1:1101:10003:13869 1:N:0:NCGCTCGA

I was looking through the documentation but couldn't find an answer. Does the "filterbyname.sh" script work on entire lines or on the first field before whitespace?

ADD REPLY • link 3.0 years ago by jfaberha ▴ 50

0

Entering edit mode

If headers in your files look like @K00124:567:HF5W5BBXX:1:1101:10003:13869 1:N:0:NCGCTCGA but if your names file just has K00124:567:HF5W5BBXX:1:1101:10003:13869 then you will have to include substring=tto filter the reads.

I filtered ~300K reads and kept 6 reads from my names file (include=t) in less than a second so it should not take 4 days to run one sample.

Time:                           0.687 seconds.
Reads Processed:        299k    436.56k reads/sec
Bases Processed:      89057k    129.66m bases/sec
Reads Out:          6
Bases Out:          1782

Removing those six reads and keeping rest (include=f) required

Time:                           0.793 seconds.
Reads Processed:        299k    378.12k reads/sec
Bases Processed:      89057k    112.30m bases/sec
Reads Out:          299850
Bases Out:          89055450

ADD REPLY • link 3.0 years ago by GenoMax 141k

0

Entering edit mode

Thank you for the input. Those numbers are closer to what I was hoping for in terms of processing time and they make me think I have the BBmap configured wrong. I'm going to attempt to troubleshoot/reinstall to see if I can't up it's efficiency.

ADD REPLY • link 3.0 years ago by jfaberha ▴ 50

1

Entering edit mode

Simply use names=namefile substring=t as only options. Leave everything else at default. Also assign memory explicitly with -XmxNNg. The NN number should be about 75% of memory you have available at max.

ADD REPLY • link 3.0 years ago by GenoMax 141k