Remove duplicates reads Ids
4
2
Entering edit mode
7.0 years ago
BioGeek ▴ 170

I mapped reads with

bwa mem -M -t 40 allCombinedFinalSet.fa Seq.R1.fastq Seq.R2.fastq > aln.sam

Extracted the mapped reads

samtools view -f 0x2 -b aln.bam > output.bam

Extracted the fastq

bamToFastq -i output.bam -fq R1.fq -fq2 R2.fq 

grep @HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1 R1.fq             []
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1

I notice it has duplicated ....

I think this because read was mapped twice (i.e. BWAmem).

I tried fastuniq but it does not remove the duplicated reads.

Can you please help me to remove duplicated reads from fastq files.

reads ids duplicates mapping fastq • 5.6k views
ADD COMMENT
0
Entering edit mode

What exactly are you trying to do?

ADD REPLY
0
Entering edit mode

I am trying to mapped the allReads against "filtered" genome and extract mapped fastq for re-assembly.

ADD REPLY
1
Entering edit mode
7.0 years ago

Those duplicates are secondary/supplementary alignments. You may try Picard SamToFastq for a better control over what to convert in Fastq from BAM. See the discussion here: http://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html

ADD COMMENT
0
Entering edit mode
7.0 years ago
k.balaji93 ▴ 10

Read Samtools man for flag descriptions. You want to filter out secondary and chimeric alignments, using samtools in the bam file. Check out Picard explain flag to get the exact flag to keep only primary alignments.

ADD COMMENT
0
Entering edit mode
7.0 years ago

(assuming that the reads in R1.fq are in the same order + same number than R2.fq)

paste <(cat R1.fq | paste - - - - )  <(cat R2.fq | paste - - - - ) | LC_ALL=C sort -t $'\t' -k1,1  | tr "\t" "\n" > interleaved.fq
ADD COMMENT
0
Entering edit mode
bio@bio214b[biogeek] paste <(cat see.R1.fq | paste - - - - )  <(cat see.R2.fq | paste - - - - ) | LC_ALL=C sort -t $'\t' -k1,1  | tr "\t" "\n" > interleaved.fq
bio@bio214b[biogeek] grep HJ2KCBCXX:1:1104:14672:39678/1 interleaved.fq
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1
bio@bio214b[biogeek] grep HJ2KCBCXX:1:1104:14672:39678/2 interleaved.fq
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/2
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/2
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/2

The duplicated still exists.

ADD REPLY
0
Entering edit mode

If all you need to do is remove duplicates then use dedupe.sh from BBMap suite. dedupe.sh in=orig.fq out=dedupe.fq Should work with in1= in2= out1= out2= for PE files.

ADD REPLY
0
Entering edit mode
$ dedupe.sh in1=see.R1.fq in2=see.R2.fq out1=tt.R1.fq out2=tt.R2.fq          []
java -Djava.library.path=/home/Tools/bbmap/jni/ -ea -Xmx200291m -Xms200291m -cp /home/Tools/bbmap/current/ jgi.Dedupe in1=see.R1.fq in2=see.R2.fq out1=tt.R1.fq out2=tt.R2.fq
Executing jgi.Dedupe [in1=see.R1.fq, in2=see.R2.fq, out1=tt.R1.fq, out2=tt.R2.fq]

Exception in thread "main" java.lang.RuntimeException: Unknown parameter out1=tt.R1.fq
    at jgi.Dedupe.<init>(Dedupe.java:392)
    at jgi.Dedupe.main(Dedupe.java:82)

It gave an error !!!

ADD REPLY
0
Entering edit mode

While dedupe.sh will work to remove duplicate read ID's I realized that it will have an additional unintended consequence. It will actually de-duplicate your input dataset based on sequence. If you are able to accept that caveat then read on.

dedupe.sh requires input to be interleaved in case of paired-end reads so you would be able to use it like this:

  1. Interleave reads
  2. de-dupe reads
  3. Split the deduped reads into separate R1/R2 files

reformat.sh in1=R1.fq in2=R2.fq out=stdout.fq | dedupe.sh in=stdin.fq out=stdout.fq | reformat.sh in=stdin.fq out1=newR1.fq out2=newR2.fq

I am going to tag Brian Bushnell to see if he has any other suggestion.

My apologies for not thinking this through completely.

ADD REPLY
0
Entering edit mode

I don't really have a tool for this purpose, once you already have a bam file. Instead, I'd recommend using BBMap, like this:

bbmap.sh t=40 ref=allCombinedFinalSet.fa in1=Seq.R1.fastq in2=Seq.R2.fastq outm=mapped.fq outu=unmapped.fq

The mapped reads will go to mapped.fq, and the unmapped reads will go to unmapped.fq. Both files will be interleaved, and pairs will be kept together, so if only one read in a pair maps to the reference, both reads will still go to mapped.fq.

ADD REPLY
0
Entering edit mode
7.0 years ago
BioGeek ▴ 170
    #!/usr/bin/perl-w
use strict;
use warnings;

my %id2seq=();
my $key = '';
$/ = '@HISEQ';
my $add='@HISEQ';
while(<>){
    chomp;
    my ($defLine, @seqLines) = split /\n/, $_;
    my $sequence = join("\n",@seqLines);
        $id2seq{$defLine} = $sequence;
}

foreach(keys %id2seq){
    my $name="$add"."$_";  
    print join("\n",$name,$id2seq{$_}),"\n";
}

I tried my above mentioned perl script to remove, and in first look it seems to be working.

bio@bio214b[biogeek] grep HJ2KCBCXX:1:1104:14672:39678/2 test.R2.fq 
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/2
bio@bio214b[biogeek] grep HJ2KCBCXX:1:1104:14672:39678/1 test.R1.fq
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1

Is that safe to use? It seems distorting the format.

ADD COMMENT
0
Entering edit mode

It seems distorting the format

What exactly is it doing?

ADD REPLY

Login before adding your answer.

Traffic: 2510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6