Hi guys, I am trying to use seqkit rmdup to remove duplicated sequences from my protein fasta files. However, it's only the accession numbers which are duplicated and not the description or sequences. See example below.
Host_331002_c0_seq1 95 1381 2 +
Host_331002_c0_seq1 1873 2112 1 +
So basically I want to set a flag which will stop at the first tab when searching the identifiers otherwise I won't get any duplicates in my output file. I think this flag would fix it but I am not sure what to enter as regex
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
I just started learning all the programming languages and I am not certain how to change the default so it will stop after " Host_331002_c0_seq1" . Thanks is advance for your help!
Cheers!
seems to be working fine for me. Please post specific example. input:
output:
Oh I see. Well, the sequences are different in my case. See example below.
So it is really just the accession number which was used twice in the database I am working with. Since both the whole identifier (accession + description) and the sequences are different seqkit does not identify it as duplicates. My goal is to pull out the sequences with duplicated accessions and rename them manually. My database contains more then 1.5 million sequences so doing it manually is not really an option. Thanks again for your help!
input:
output
on the other hand, if id is always followed by a space, you can use awk to print id and sequence and then de-duplicate the file. But keep a back up of your data first.
Interesting! Repeating your commands using a test file worked for me as well so it seems to be an issue with the format of my input database. Thank you!