Question

Two multifasta files with the same sequences, but different headers. How to change headers to match?

0

Entering edit mode

7.5 years ago

roblogan6 ▴ 30

I have a large multifasta file (about 125,000 sequences) and a smaller multifasta file (about 100 sequences). All sequences in the smaller multifasta file are found in the larger file, but the headers are different. I have many (thousands) of such smaller multifasta files. How can I search the larger file for the sequences found in the smaller and then exchange the header? I would ideally be able to print out a smaller multifasta file that would be identical to the one I started with, just with the headers found in the larger file. All sequences in both files have been linearized- that is, they are a single line. Thanks!

perl multifasta fasta • 2.6k views

ADD COMMENT • link updated 7.5 years ago by shenwei356 8.4k • written 7.5 years ago by roblogan6 ▴ 30

1

Entering edit mode

7.5 years ago

shenwei356 8.4k

Solution of grep -B1 by @roblogan6 is very cool, however as he/she said, it matched substrings instead of whole sequences. Besides, sequences both in the big and small files must be in single-line format.

Here's is a robust preciser solution with SeqKit:

Big file:

$ cat big.f
>seq1
ACTACGACGTC
TAGCGTA
>seq2
CGACGATCTAC
GTAGCTAGAT
>seq3
ACGTCTGACGT
>seq4 containing seq3
ACGTACGTCTG
ACGTCC

Small file:

$ cat small.fa
>seq_abc
ACTACGACGTC
TAGCGTA
>seq_123
ACGTCTGACGT

Precisely matching by sequences:

$ seqkit grep -s -i -f <(seqkit seq -s -w 0 small.fa) -w 70 big.fa    
>seq1
ACTACGACGTCTAGCGTA
>seq3
ACGTCTGACGT

Here's the "long-option" version:

seqkit grep --by-seq --ignore-case --pattern-file <(seqkit seq --seq --line-width 0 small.fa) --line-width 70 big.fa

ADD COMMENT • link 7.5 years ago by shenwei356 8.4k

score 2 · Accepted Answer · 2016-10-28

2

Entering edit mode

7.5 years ago

Eric Lim ★ 2.1k

The following might work for you.

grep -f other.fa reference.fa -B1 --no-group-separator

or

grep -f other.fa reference.fa -B1 | grep -v -- "^--$"

if --no-group-separator is not available in your version of grep.

Note that this will match substrings, which could be unwanted or undesired depending on your use case.

ADD COMMENT • link 7.5 years ago by Eric Lim ★ 2.1k

0

Entering edit mode

Thanks! Is there any way to change the grep command to only print out the headers themselves, instead of also including the associated sequences? I am not very familiar with grep. Thanks again.

ADD REPLY • link 7.5 years ago by roblogan6 ▴ 30

1

Entering edit mode

You could use grep "^>" , which would get you just the headers.

Looks like the original grep solution worked? I am curious since you had large files.

ADD REPLY • link 7.5 years ago by GenoMax 141k

0

Entering edit mode

I couldn't do it locally because grep runs out of memory, but I can run it successfully over the server that I have access to. I need to learn more about grep, awk, sed, etc. They seem quick, powerful, and really simple. Maybe I am deceiving myself about the simple part though! Do you have any resources to suggest for learning more about these type of commands? Thanks again.

ADD REPLY • link 7.4 years ago by roblogan6 ▴ 30