fasta extract using seqkit grep regex pattern
2
0
Entering edit mode
18 months ago
mthm ▴ 50

I want to extract seuqneces that partially match the name list. my fasta file:

>4_chromosome_RagTag|c(3245459..3246846)|DNA|DNA-1_Dmon
CTACTTAAATCACTTGATGGCGATCGAATTCAATATAGGTCTATCGATTCTGTTACT
>4_chromosome_RagTag|(3012739..3012917)|DNA|DNA-2_Dmon
CATATATCAAATATTCACTGCAGGGTCTTTGCAGACAATTTTGAAAAAATGCTCACAA
>4_chromosome_RagTag|c(1350081..1350297)|DNA|DNA-2_Dmon
AAAATAAGTATTATTAACAAATCAATTTTTATATTTATTTTTATTATTTTAAATTTA

my namelist:

12471413
3245459
1350081

I am trying this code:

seqkit grep --pattern  "^[^]/|/(\D )(\w+) " --pattern-file coordinates file.fasta

so I am trying to tell it to ignore the first part and find the match numbers in the second part, but it doesn't work, how should I fix it?

seqkit fasta extract pattern regex • 2.2k views
ADD COMMENT
2
Entering edit mode

Using filterbyname.sh from BBMap suite:

$ filterbyname.sh -Xmx2g in=test.fa names=1350081 out=stdout.fa substring=t include=t

Input is being processed as unpaired
>4_chromosome_RagTag|c(1350081..1350297)|DNA|DNA-2_Dmon
AAAATAAGTATTATTAACAAATCAATTTTTATATTTATTTTTATTATTTTAAATTTA
Time:                           0.092 seconds.
Reads Processed:           3    0.03k reads/sec
Bases Processed:         172    0.00m bases/sec
Reads Out:          1
Bases Out:          57
ADD REPLY
3
Entering edit mode
18 months ago

Are the patterns starting coordinates? If yes:

seqkit grep --id-regexp '(\d+)\.\.' --pattern-file coordinates file.fasta -o result.fasta

If no:

seqkit grep --id-regexp '(\d+)\.\.' --pattern-file coordinates file.fasta > part1.fasta

seqkit grep --id-regexp '\.\.(\d+)' --pattern-file coordinates file.fasta > part2.fasta

cat part1.fasta part2.fasta > result.fasta
ADD COMMENT
1
Entering edit mode

This is awesome! I didn't know that the --id-regexp flag exists. Thanks a lot, this will be very handy for me, too!

Too bad that Go doesn't support lookaheads, lookbehinds and other PCRE goodies like \K. Otherwise, [\(\.]\K\d+ could have been used to match start as well as end coordinates with just one call.

ADD REPLY
1
Entering edit mode
18 months ago

If you want seqkit to interpret your patterns as a regex, you have to provide the -r / --use-regexp flag. However, this means that your patterns are directly interpreted as regex and not that you can specify a template regex that is used to expand the contents of your namelist accordingly.

Therefore, you either have to put the respective Regexes directly in your namelist file or you split the process up into two steps:

grep -f namelist yoursequences.fasta | sed 's/^>//g' > idlist

seqkit grep --pattern-file idlist yoursequences.fasta >  yourfilteredsequences.fasta

Edit: As I just learned from @shenwei356, the --id-regexp flag exists, so there is indeed the option to specify a regex pattern and match in --pattern-file on the evaluated regex.

ADD COMMENT

Login before adding your answer.

Traffic: 1729 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6