Question

fasta extract using seqkit grep regex pattern

0

Entering edit mode

18 months ago

mthm ▴ 50

I want to extract seuqneces that partially match the name list. my fasta file:

>4_chromosome_RagTag|c(3245459..3246846)|DNA|DNA-1_Dmon
CTACTTAAATCACTTGATGGCGATCGAATTCAATATAGGTCTATCGATTCTGTTACT
>4_chromosome_RagTag|(3012739..3012917)|DNA|DNA-2_Dmon
CATATATCAAATATTCACTGCAGGGTCTTTGCAGACAATTTTGAAAAAATGCTCACAA
>4_chromosome_RagTag|c(1350081..1350297)|DNA|DNA-2_Dmon
AAAATAAGTATTATTAACAAATCAATTTTTATATTTATTTTTATTATTTTAAATTTA

my namelist:

12471413
3245459
1350081

I am trying this code:

seqkit grep --pattern  "^[^]/|/(\D )(\w+) " --pattern-file coordinates file.fasta

so I am trying to tell it to ignore the first part and find the match numbers in the second part, but it doesn't work, how should I fix it?

seqkit fasta extract pattern regex • 2.2k views

ADD COMMENT • link updated 18 months ago by shenwei356 8.4k • written 18 months ago by mthm ▴ 50

2

Entering edit mode

Using filterbyname.sh from BBMap suite:

$ filterbyname.sh -Xmx2g in=test.fa names=1350081 out=stdout.fa substring=t include=t

Input is being processed as unpaired
>4_chromosome_RagTag|c(1350081..1350297)|DNA|DNA-2_Dmon
AAAATAAGTATTATTAACAAATCAATTTTTATATTTATTTTTATTATTTTAAATTTA
Time:                           0.092 seconds.
Reads Processed:           3    0.03k reads/sec
Bases Processed:         172    0.00m bases/sec
Reads Out:          1
Bases Out:          57

ADD REPLY • link 18 months ago by GenoMax 141k

1

Entering edit mode

18 months ago

Matthias Zepper 4.5k

If you want seqkit to interpret your patterns as a regex, you have to provide the -r / --use-regexp flag. However, this means that your patterns are directly interpreted as regex and not that you can specify a template regex that is used to expand the contents of your namelist accordingly.

Therefore, you either have to put the respective Regexes directly in your namelist file or you split the process up into two steps:

grep -f namelist yoursequences.fasta | sed 's/^>//g' > idlist

seqkit grep --pattern-file idlist yoursequences.fasta >  yourfilteredsequences.fasta

Edit: As I just learned from @shenwei356, the --id-regexp flag exists, so there is indeed the option to specify a regex pattern and match in --pattern-file on the evaluated regex.

ADD COMMENT • link 18 months ago by Matthias Zepper 4.5k

score 3 · Accepted Answer · 2022-10-21

3

Entering edit mode

18 months ago

shenwei356 8.4k

Are the patterns starting coordinates? If yes:

seqkit grep --id-regexp '(\d+)\.\.' --pattern-file coordinates file.fasta -o result.fasta

If no:

seqkit grep --id-regexp '(\d+)\.\.' --pattern-file coordinates file.fasta > part1.fasta

seqkit grep --id-regexp '\.\.(\d+)' --pattern-file coordinates file.fasta > part2.fasta

cat part1.fasta part2.fasta > result.fasta

ADD COMMENT • link 18 months ago by shenwei356 8.4k

1

Entering edit mode

This is awesome! I didn't know that the --id-regexp flag exists. Thanks a lot, this will be very handy for me, too!

Too bad that Go doesn't support lookaheads, lookbehinds and other PCRE goodies like \K. Otherwise, [\(\.]\K\d+ could have been used to match start as well as end coordinates with just one call.

ADD REPLY • link 18 months ago by Matthias Zepper 4.5k