Useful Bash Commands To Handle Fasta Files
12
49
Entering edit mode
12.2 years ago
Anima Mundi ★ 2.9k

Hello,

Starting from this question, I realized that the proper usage of bash commands to handle FASTA files* could be, for those (like me) not proficient with the usage of the terminal, a difficult task. Also, I feel it is important to learn how to use them correctly. Could you point me out what are, in your personal experience, the most important commands useful in FASTA lists manipulation? Possibly, I would prioritize commands wich are easy to use and, possibly, versatile.

If you want, I agree to treat the topic as a community wiki.

*Extract IDs, remove certain sequences, edit descriptions, listing only the sequences that start with A and similar.

command-line fasta bash • 137k views
ADD COMMENT
4
Entering edit mode

cat, grep, sed, cut, sort, uniq, tr, awk, paste, Redirections (<, >, >>, ...)

ADD REPLY
1
Entering edit mode

You can certainly do a lot of FASTA file processing (and any other text file) in the shell. However, I'd recommend learning at least one of the Bio* libraries (e.g. Bioperl's Seq::IO) for more versatile solutions.

ADD REPLY
1
Entering edit mode

I like very much to have a repository of common and useful bash scripts/commands. I suggest to extend this repository to more topics like VCF, BED manipulations, etc.

ADD REPLY
0
Entering edit mode

Thanks!

ADD REPLY
44
Entering edit mode
12.2 years ago

My top used bash commands for fasta files:

(1) counting number of sequences in a fasta file:

grep -c "^>" file.fa

(2) add something to end of all header lines:

sed 's/>.*/&WHATEVERYOUWANT/' file.fa > outfile.fa

(3) clean up a fasta file so only first column of the header is outputted:

awk '{print $1}' file.fa > output.fa

Here are some other cool ones: http://chrisduran.eu/bioinformatics/linux-and-osx-commands-for-working-with-fasta-files/

ADD COMMENT
5
Entering edit mode

First example probably my all-time favourite.

ADD REPLY
0
Entering edit mode
wc -l file.fa

or

wc file.fa

in the first column, you get the line count. divide by 2 if fasta, 4 if fastq.

This probably way faster than the grep approach.

ADD REPLY
5
Entering edit mode

This is good but wont work for the odd fasta file where the sequence is wrapped onto a newline after a certain number of characters as some programs like to return.

ADD REPLY
1
Entering edit mode

This also is interesting: what if I would like to replace the "WHATEVERYOUWANT" with a variable with a different value for each sequence?

ADD REPLY
0
Entering edit mode

do you know any tools for counting the actual sequence , ie the characters in a single sequence , a cds file, a protein string

ADD REPLY
0
Entering edit mode

yes, seqmagick, see my answer below

ADD REPLY
24
Entering edit mode
12.2 years ago

You will probably get a lot of different answers because there are many ways to parse fasta files with Bash and tools like grep, awk and sed. Here are some suggestions.

To extract ids, just use the following:

grep -o -E "^>\w+" file.fasta | tr -d ">"

A useful step is to linearize your sequences (i.e. remove the sequence wrapping). This is not a perfect solution, as I suspect that a few steps could be avoided, but it works quite fast, even for thousands of sequences.

sed -e 's/\(^>.*$\)/#\1#/' file.fasta | tr -d "\r" | tr -d "\n" | sed -e 's/$/#/' | tr "#" "\n" | sed -e '/^$/d'

Remove duplicated sequences. Pierre Lindenbaum proposed this solution.

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -t $'\t' -f -k 2,2  | sed -e 's/^/>/' -e 's/\t/\n/'

Once you've done that, it's easier to do things such as "list only the sequences that start with A", with grep for instance.

ADD COMMENT
0
Entering edit mode

Interesting, also this should work to list only the sequences that start with A: awk 'sub(/^A/, ">1\nA")'

ADD REPLY
0
Entering edit mode

Also useful after linearizing your sequences and losing the original file is the fold command to re-wrap a text file.

ADD REPLY
4
Entering edit mode
6.5 years ago

Bash one liner to get the A T (or U) G C counts for all the sequences in a multi fasta file

echo -e "seq_id\tA\tU\tG\tC"; while read line; do echo $line | grep ">" | sed 's/>//g'; for i in A U G C;do echo $line | grep -v ">" | grep -o $i | wc -l | grep -v "^0"; done; done < your_fasta_file.fa | paste - - - - -

Example

$echo -e "seq_id\tA\tT\tG\tC"; while read line; do echo $line | grep ">" | sed 's/>//g'; for i in A T G C;do echo $line | grep -v ">" | grep -o $i | wc -l | grep -v "^0"; done; done < adapter.fasta | paste - - - - -

Output

seq_id  A   T   G   C
adapter1    7   8   7   8
adapter2    4   3   3   2
ADD COMMENT
0
Entering edit mode

This answer will show wrong AUGC count number when one of your sequence ( in your multiple sequence FASTA file) is not with all four characters, AUGC.

For example,

> seq_1
AUGC
> seq_2
AU
> seq_3
G

Then the command line may not be working.

ADD REPLY
3
Entering edit mode
12.2 years ago

Have a look at Biopieces www.biopieces.org).

ADD COMMENT
3
Entering edit mode
7.3 years ago
Chun-Jie Liu ▴ 280

I write some general scripts for handle NGS data. I hope it can help you. https://github.com/chunjie-sam-liu/useful-scripts

ADD COMMENT
2
Entering edit mode
7.7 years ago
nim.1111.ou ▴ 180

ive been working on bioinformatics for over three years, here are lots of frequently used bash commands https://github.com/onceupon/Bash-Oneliner/blob/master/README.md

ADD COMMENT
2
Entering edit mode
7.3 years ago

Linearizing the complete fasta file can be done using,

while read line;do if [ "${line:0:1}" == ">" ]; then echo -e "\n"$line; else echo $line | tr -d '\n' ; fi; done < input.fasta > output.fasta

Once linearized, say you want pick the sequence for the id Q15049 you can use

grep -A1 'Q15049' output.fasta

This will give you the header and sequence of the id.

Hope its useful

ADD COMMENT
0
Entering edit mode

just two comments.

First, you can remove the initial empty lines piping sed '/^--$/d' so the whole code ends up like

while read line; do
  if [ "${line:0:1}" == ">" ]; then 
    echo -e "\n"$line;
  else 
    echo $line | tr -d '\n' ; 
  fi; 
done < input.fasta | sed '/^--$/d' > output.fasta

And second, the previous code works great but the following awk version is much faster

awk '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}}' input.fasta > output.fasta
ADD REPLY
1
Entering edit mode
ADD COMMENT
1
Entering edit mode
6.5 years ago
Joe 21k

Couple I stick in my .bashrc for quick and dirty work (they all print to screen so they can be piped):


Return the lengths of all the sequences in a multifasta

falens(){
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' $1
}

*Disclaimer - I think I stole this off the internet somewhere.


Remove duplicated fastas in a multifasta:

dedupe(){
cat $1 | awk '!_[$0]++'
}

Merge a multifasta into a single fasta sequence (remove all but the first header, and deal with newlines):

fastcat(){
cat $1 | sed -e '1!{/^>.*/d;}' | sed  ':a;N;$!ba;s/\n//2g'
}

Split a multifasta in to separate files (with arbitrary names). This was my pure bash answer to a biostars post some time ago (doesn't print to screen, just makes files in current dir).

splitfa(){
i=1;
while read line ; do
  if [ ${line:0:1} == ">" ] ; then
    header="$line"
    echo "$header" >> seq"${i}".fasta
  else
    seq="$line"
    echo "$seq" >> seq"${i}".fasta
    ((i++))
  fi
done < $1
}
ADD COMMENT
0
Entering edit mode
  1. Save all these quick scripts in a file under, say, ~/.myShellScripts/fasta_utils.sh
  2. Add source ~/.myShellScripts/fasta_utils.sh to ~/.${SHELL}rc
  3. Better modular maintenance :-)

I also save everything in ~/.zshrc, but I love how oh-my-zsh does stuff. They complicate it too much though and I have to unset their stuff before I set mine.

ADD REPLY
0
Entering edit mode

That's quite a nice way to do it, in actual fact I lied slightly, and these functions are in a file called .bash_functions which .bashrc then sources directly.

ADD REPLY
0
Entering edit mode

Joe you just add the dir path to .bashrc or you did something else? Thanks

ADD REPLY
1
Entering edit mode

Yeah, it uses Ram's approach and sources the file from within bashrc essentially

ADD REPLY
0
Entering edit mode
2.9 years ago
jena ▴ 290

I think a lot of folks here would benefit from knowing about the seqmagick tool.

Seqmagick is a kickass little utility built in the spirit of imagemagick to expose the file format conversion in Biopython in a convenient way. Instead of having a big mess of scripts, there is one that takes arguments

It can be easily installed with pip or conda (using the bioconda channel), even without root permissions.

ADD COMMENT
0
Entering edit mode
8 months ago

Add dummy sequence IDs for a set of sequences:

awk '{print ">Seq_"NR"\n"$1}' sequences.txt > sequences.fasta

A simple way to add dummy sequence IDs if you only have the sequences themselves (1 per line) and are working with a tool that requires data in FASTA format.

ADD COMMENT

Login before adding your answer.

Traffic: 1968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6