about Grep the complete sequences containing a specific motif in a fasta file
1
0
Entering edit mode
6.6 years ago
taojincs ▴ 50

How to Grep the complete sequences containing a specific motif in a fasta file? Also, I want to include the lines beginning with a ">" before these target sequences.

The image is not shown so I will add this link of example because typing > in biostar is kinda misleading: https://drive.google.com/file/d/0B1pci7ps8bLganZXWFNFcWZGd1k/view?usp=sharing

An example is shown in the image:

sequence grep fasta linux • 2.9k views
ADD COMMENT
0
Entering edit mode

Test file:

$ cat test.fa 
>name1
AEDIA
>name2
ALKME
>name3
AAIII
>name4
kmetq

To extract all sequences with KME in them and one can ignore the case as well in the example code:

 $ seqkit grep -s -i -r -p KME test.fa 

>name2
ALKME
>name4
kmetq

Download seqkit here. -s = match only sequence; -r = pattern is regular expression; -i = ignore case; -p = search pattern

if fasta sequences are linearized (i.e all sequences are in a single line), then code would be:

$ grep -i -B 1 --no-group-separator kme test.fa 
>name2
ALKME
>name4
kmetq
ADD REPLY
2
Entering edit mode
6.6 years ago

First, you'd have to change your sequences so that the DNA is all in one line, without this step you'd miss possible motifs hits that have line breaks in them.

From Pierre Lindenbaum: A: Multiline Fasta To Single Line Fasta

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa > one_line.fa

Then you can use grep -B 1 to get the hit with its preceding line, let's also use LC_ALL=C to speed things up:

LC_ALL=C grep -B 1 KME one_line.fa

that should print all sequence names and their sequence where 'KME' is present.

ADD COMMENT

Login before adding your answer.

Traffic: 1548 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6