How to print 2 lines if first has "*" in it
6
0
Entering edit mode
7.4 years ago

I have a long file (fasta) like this:

>Eha-Novel-38_5p 
ACCCATTTTCGTCTGAGGATAAT
>Eha-Novel-38_3p* 
TTTCCCAGACCCAAATGGGTGC
>Eha-Novel-46_3p 
AATGGCGGCCTGATATCCCGGA
>Eha-Novel-46_5p* 
TGGGGTATTAAGCCGCGATTGT
>Eha-Novel-44_3p 
TATCACAGTCATTTACGGGTAC
>Eha-Novel-44_5p* 
TCCCGTATTTGACTGTGACTGAG

I want to print only lines without the "*" and its following line.

Desired output:

>Eha-Novel-38_5p 
ACCCATTTTCGTCTGAGGATAAT
>Eha-Novel-46_3p 
AATGGCGGCCTGATATCCCGGA
>Eha-Novel-44_3p 
TATCACAGTCATTTACGGGTAC

I tried using grep "*" -v -A 1 FILE, but that did not work.

Thanks for your help.

fasta grep • 2.5k views
ADD COMMENT
0
Entering edit mode

Maybe you can try:

cat FILE | grep "\*" >OUTPUT
ADD REPLY
0
Entering edit mode

This way you wouldn't have the sequence, just the identifier.

ADD REPLY
0
Entering edit mode

You are right.Maybe sed works.sed -n -e "/\p$/ {p;n;p}" FILE >OUTPUT

ADD REPLY
1
Entering edit mode

Everyone needs to learn the before and after flags for grep! Just do -A 1 on the grep:

grep -A 1 "\*" file.fasta >OUTPUT.fasta

-A for lines After
-B for lines Before
-C for lines before and after
ADD REPLY
0
Entering edit mode

I tried using grep "*" -v -A 1 FILE, but that did not work.

The OP knew about -A, it just doesn't work well with -v.

ADD REPLY
2
Entering edit mode

There's no need to do an inverse search, just search for lines ending 'p'.

ADD REPLY
3
Entering edit mode
7.4 years ago
Daniel ★ 4.0k

I think everyone's overthinking this... to select only the non '*' ending headers (i.e. the lines ending in p) just do:

grep -A 1 ">.*p$" file.fasta >output.fasta

Explanation: 
- line starts as a fasta (>)
- has any amount of characters (.*)
- ends with a p (p$)
Then also print the line after it (-A 1).

Edit: To be honest, there's no need for the complicated regex, as we know there's not going to be a 'p' in any sequence lines. So this would work too:

grep -A 1 "p$" file.fasta >output.fasta
ADD COMMENT
0
Entering edit mode

better adding --no-group-separator

ADD REPLY
2
Entering edit mode
7.4 years ago

I tried to use shell grep, but it's hard to do this.

Try the grep (usage) of SeqKit, just download the executable binary file and run:

./seqkit grep -r -p "\*" -v FILE
>Eha-Novel-38_5p 
ACCCATTTTCGTCTGAGGATAAT
>Eha-Novel-46_3p 
AATGGCGGCCTGATATCCCGGA
>Eha-Novel-44_3p 
TATCACAGTCATTTACGGGTAC

Long-option version:

./seqkit grep --use-regexp --pattern "\*" --invert-match FILE
ADD COMMENT
1
Entering edit mode
7.4 years ago

Hi,

It looks like grep invert search (-v) and context (-A) are not working well together. I found one-liner solutions with sed and awk but here is a less elegant but probably simpler solution using only grep :

grep ">" FILE | grep -v "*" | grep -f - FILE -A 1

This first look for headers in your FILE, then select those without * and use them to search the file again. If speed is not an issue, I guess this is an ok solution.

PS : If you want to remove the "- -" from the output, you can do it with one more grep (or awk, sed or whatever you like).

grep ">" FILE | grep -v "*" | grep -f - FILE -A 1 | grep -v "\-\-"
ADD COMMENT
0
Entering edit mode

It works well if the sequences are single-line.

ADD REPLY
1
Entering edit mode
7.4 years ago
5heikki 11k
awk '{if(/^>/ && ! /\*$/){getline var; print $0"\n"var}}' FILE

If line starts with ">" and doesn't end in "*", get the next line into var, print the current line, linebreak, and var.

Also, OP please be more meticulous, the title, what you want, and desired output all differ. The above produces desired output (as long as there are no linebreaks in sequences).

ADD COMMENT
0
Entering edit mode

This solution is simple, efficient, readable,... Love awk's magic.

ADD REPLY
0
Entering edit mode

I modified it a little bit while you commented. I think it's more clear like this since print $0 doesn't get repeated. If somebody cares, it was like this before:

awk '{if(/^>/ && ! /\*$/){print $0; getline; print $0}}' FILE

If line starts with ">" and doesn't end in "*", print the line, get the next line, print it.

edit. Maybe this is more clear still:

awk '{if(/^>/ && ! /\*$/){print $0; print $(getline)}}' FILE

If line starts with ">" and doesn't end in "*", print the line, print the next line (returned by $(getline)). I don't know if there's any real difference in speed, but I imagine this one is the fastest..

ADD REPLY
0
Entering edit mode
7.4 years ago
michael.ante ★ 3.8k

Hi Bastianfromm,

As zjhzwang mentioned, you need to escape the "*", otherwise the standard wild card character greps everything. Afterwards, you need to clean the separators, grep is inserting:

grep -A 1 "\*" in.fa | sed '/--/d' > out.fa

Cheers, Michae

ADD COMMENT
0
Entering edit mode

I think you can avoid the separators using the --no-group-separator flag (see man page).

ADD REPLY
0
Entering edit mode

Cheers but so I get the

>Eha-Novel-44_5p*
TCCCGTATTTGACTGTGACTGAG
>Eha-Novel-46_5p*
TGGGGTATTAAGCCGCGATTGT

While I would like the lines WITHOUT "*" and the 1 following

ADD REPLY
1
Entering edit mode

I guess the following would do the trick:

grep -v -A 1 --no-group-separator "\*" in.fa > out.fa
ADD REPLY
0
Entering edit mode

It failed. Combination of -A and -v can't work correctly.

ADD REPLY
0
Entering edit mode

Hmm that makes sense, linearizing and back to two-lines would solve this probably.

ADD REPLY
0
Entering edit mode

nope :-( although it has no "--" between the greps

ADD REPLY
0
Entering edit mode
7.4 years ago

I solved it in two steps

grep --no-group-separator -e ">*\*" -v  FILEA|grep ">" > IDS_forgrep.txt
grep --no-group-separator -f IDS_forgrep.txt FILEA -A 1

and in one step

grep  --no-group-separator -e "p$" FILEA -A 1
ADD COMMENT

Login before adding your answer.

Traffic: 1803 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6