Biostar Beta. Not for public use.
Useful Bash Commands To Handle Fasta Files
30
Entering edit mode
3.3 years ago
Anima Mundi ♦ 2.4k
@Anima Mundi1428

Hello,

starting from this question, I realized that the proper usage of bash commands to handle FASTA files* could be, for those (like me) not proficient with the usage of the terminal, a difficult task. Also, I feel it is important to learn how to use them correctly. Could you point me out what are, in your personal experience, the most important commands useful in FASTA lists manipulation? Possibly, I would prioritize commands wich are easy to use and, possibly, versatile.

If you want, I agree to treat the topic as a community wiki.

*Extract IDs, remove certain sequences, edit descriptions, listing only the sequences that start with A and similar.

command-line command-line fasta list bash • 75k views
ADD COMMENTlink
4
Entering edit mode

cat, grep, sed, cut, sort, uniq, tr, awk, paste, Redirections (<, >, >>, ...)

ADD REPLYlink
1
Entering edit mode

You can certainly do a lot of FASTA file processing (and any other text file) in the shell. However, I'd recommend learning at least one of the Bio* libraries (e.g. Bioperl's Seq::IO) for more versatile solutions.

ADD REPLYlink
1
Entering edit mode

I like very much to have a repository of common and useful bash scripts/commands. I suggest to extend this repository to more topics like VCF, BED manipulations, etc.

ADD REPLYlink
0
Entering edit mode

Thanks!

ADD REPLYlink
36
Entering edit mode
2.4 years ago
@Damian Kao3226

My top used bash commands for fasta files:

(1) counting number of sequences in a fasta file:

grep -c "^>" file.fa

(2) add something to end of all header lines:

sed 's/>.*/&WHATEVERYOUWANT/' file.fa > outfile.fa

(3) clean up a fasta file so only first column of the header is outputted:

awk '{print $1}' file.fa > output.fa

Here are some other cool ones: http://chrisduran.eu/bioinformatics/linux-and-osx-commands-for-working-with-fasta-files/

ADD COMMENTlink
5
Entering edit mode

First example probably my all-time favourite.

ADD REPLYlink
0
Entering edit mode
wc -l file.fa

or

wc file.fa

in the first column, you get the line count. divide by 2 if fasta, 4 if fastq.

This probably way faster than the grep approach.

ADD REPLYlink
1
Entering edit mode

This is good but wont work for the odd fasta file where the sequence is wrapped onto a newline after a certain number of characters as some programs like to return.

ADD REPLYlink
1
Entering edit mode

This also is interesting: what if I would like to replace the "WHATEVERYOUWANT" with a variable with a different value for each sequence?

ADD REPLYlink
0
Entering edit mode

do you know any tools for counting the actual sequence , ie the characters in a single sequence , a cds file, a protein string

ADD REPLYlink
15
Entering edit mode
2.4 years ago
@Frédéric Mahé224

You will probably get a lot of different answers because there are many ways to parse fasta files with Bash and tools like grep, awk and sed. Here are some suggestions.

To extract ids, just use the following:

grep -o -E "^>\w+" file.fasta | tr -d ">"

A useful step is to linearize your sequences (i.e. remove the sequence wrapping). This is not a perfect solution, as I suspect that a few steps could be avoided, but it works quite fast, even for thousands of sequences.

sed -e 's/\(^>.*$\)/#\1#/' file.fasta | tr -d "\r" | tr -d "\n" | sed -e 's/$/#/' | tr "#" "\n" | sed -e '/^$/d'

Remove duplicated sequences. Pierre Lindenbaum proposed this solution.

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -t $'\t' -f -k 2,2  | sed -e 's/^/>/' -e 's/\t/\n/'

Once you've done that, it's easier to do things such as "list only the sequences that start with A", with grep for instance.

ADD COMMENTlink
0
Entering edit mode

Interesting, also this should work to list only the sequences that start with A: awk 'sub(/^A/, ">1\nA")'

ADD REPLYlink
0
Entering edit mode

Also useful after linearizing your sequences and losing the original file is the fold command to re-wrap a text file.

ADD REPLYlink
3
Entering edit mode
8.9 years ago
@Martin A Hansen4395

Have a look at Biopieces (www.biopieces.org).

ADD COMMENTlink
3
Entering edit mode
3.3 years ago
@Vijay Lakhujani26377

Bash one liner to get the A T (or U) G C counts for all the sequences in a multi fasta file

echo -e "seq_id\tA\tU\tG\tC"; while read line; do echo $line | grep ">" | sed 's/>//g'; for i in A U G C;do echo $line | grep -v ">" | grep -o $i | wc -l | grep -v "^0"; done; done < your_fasta_file.fa | paste - - - - -

Example

$echo -e "seq_id\tA\tT\tG\tC"; while read line; do echo $line | grep ">" | sed 's/>//g'; for i in A T G C;do echo $line | grep -v ">" | grep -o $i | wc -l | grep -v "^0"; done; done < adapter.fasta | paste - - - - -

Output

seq_id  A   T   G   C
adapter1    7   8   7   8
adapter2    4   3   3   2
ADD COMMENTlink
2
Entering edit mode
4.1 years ago
Chun-Jie Liu • 260
@Chun-Jie Liu33591

I write some general scripts for handle NGS data. I hope it can help you. https://github.com/chunjie-sam-liu/useful-scripts

ADD COMMENTlink
1
Entering edit mode
4.0 years ago
nim.1111.ou • 70
@nim.1111.ou18790

ive been working on bioinformatics for over three years, here are lots of frequently used bash commands https://github.com/onceupon/Bash-Oneliner/blob/master/README.md

ADD COMMENTlink
1
Entering edit mode
2.4 years ago
@vivekananthrp31669

Linearizing the complete fasta file can be done using,

while read line;do if [ "${line:0:1}" == ">" ]; then echo -e "\n"$line; else echo $line | tr -d '\n' ; fi; done < input.fasta > output.fasta

Once linearized, say you want pick the sequence for the id Q15049 you can use

grep -A1 'Q15049' output.fasta

This will give you the header and sequence of the id.

Hope its useful

ADD COMMENTlink
1
Entering edit mode
3.3 years ago
Joe 12k
@Joe

Couple I stick in my .bashrc for quick and dirty work (they all print to screen so they can be piped):


Return the lengths of all the sequences in a multifasta

falens(){
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' $1
}

*Disclaimer - I think I stole this off the internet somewhere.


Remove duplicated fastas in a multifasta:

dedupe(){
cat $1 | awk '!_[$0]++'
}

Merge a multifasta into a single fasta sequence (remove all but the first header, and deal with newlines):

fastcat(){
cat $1 | sed -e '1!{/^>.*/d;}' | sed  ':a;N;$!ba;s/\n//2g'
}

Split a multifasta in to separate files (with arbitrary names). This was my pure bash answer to a biostars post some time ago (doesn't print to screen, just makes files in current dir).

splitfa(){
i=1;
while read line ; do
  if [ ${line:0:1} == ">" ] ; then
    header="$line"
    echo "$header" >> seq"${i}".fasta
  else
    seq="$line"
    echo "$seq" >> seq"${i}".fasta
    ((i++))
  fi
done < $1
}
ADD COMMENTlink
0
Entering edit mode
  1. Save all these quick scripts in a file under, say, ~/.myShellScripts/fasta_utils.sh
  2. Add source ~/.myShellScripts/fasta_utils.sh to ~/.${SHELL}rc
  3. Better modular maintenance :-)

I also save everything in ~/.zshrc, but I love how oh-my-zsh does stuff. They complicate it too much though and I have to unset their stuff before I set mine.

ADD REPLYlink
0
Entering edit mode

That's quite a nice way to do it, in actual fact I lied slightly, and these functions are in a file called .bash_functions which .bashrc then sources directly.

ADD REPLYlink
0
Entering edit mode
4.1 years ago
@shenwei3564664

Tool: FASTA and FASTQ tools

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.3