Biostar Beta. Not for public use.
Remove/Delete unique reads from a DNA fasta file
0
Entering edit mode
17 months ago
saanasum • 10

I have millions of fasta files each containing many DNA sequences.

I want to batch remove/delete reads (DNA sequences) that are unique in each fasta file, respectively. I only want to keep duplicate reads that occur at least two times in each fasta file, respectively.

Does anybody know a command line based solution or a program/script doing that?

Many thanks!

ADD COMMENTlink
0
Entering edit mode

seqkit common is what you are looking for.

ADD REPLYlink
0
Entering edit mode

I guess this is an excellent solution for finding common sequences between two or more files (I tried it). However, I just want to extract sequences that occur at least two times within ONE file containing many fasta sequences (OR even better: to remove unique sequences in ONE fasta file). Finally, this should be done for millions of fasta files.

ADD REPLYlink
0
Entering edit mode

However, seqkit might be the right choice when using the rmdup command and with -d parameter. seqkit rmdup

ADD REPLYlink
0
Entering edit mode
  • Create a for loop in bash over your files
  • In this loop, call a python script with one of your file as argument
  • In this python script, create a dictionnary that you will fill with sequence as key
  • For each sequence if it is already a key in your dictionnary output the sequence (which is a duplicate)
ADD REPLYlink
2
Entering edit mode
17 months ago
saanasum • 10

Thanks @finswimmer for the suggestion of using seqkit.

It is possible as described here: seqkit rmdup. When using -d a file containing the duplicated reads can be specified. Using a for loop in bash should enable automation for millions of fasta files.

ADD COMMENTlink
0
Entering edit mode

Isn't this the opposite thing to what you asked?

You've asked to keep only reads that are at least duplicated or more. This will remove all the duplicates instead...perhaps I'm missing something...

ADD REPLYlink
0
Entering edit mode
-d, --dup-seqs-file string   file to save duplicated seqs.

The reads will be removed from original file and captured in new name specified is how I am reading this.

ADD REPLYlink
0
Entering edit mode

Ah yep i see, knew it was too late in the evening...

ADD REPLYlink
0
Entering edit mode

Exactly, using -d a new file containing only duplicate reads will be generated. Therefore, in this file all unique reads are deleted. I can proceed working with this new file. As batch in bash:

for i in *.fasta; do cat $i | seqkit rmdup -s -i -m -d "$i""_unique-reads-removed"; done
ADD REPLYlink
2
Entering edit mode
3 months ago
genomax 68k
United States

dedupe.sh from BBMap suite should also work. outd= will collect duplicated sequences.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1