Tutorial:Split a 'linearised' (flattened) FASTA sequence into multi-line using AWK
0
5
Entering edit mode
3.1 years ago

We have a FASTA sequence that is just the header and sequence on a single line:

cat fasta.fasta
> 1
AGTACGATCTACGTACGCAACTGAGCTACTACAGTCATGCTGACACTGACTGACACTGACTGACTGTGACACTGACTGCATGCTGCTGGCCCCGCAGTATCGACTGCGTACGTCGCGCGATTACGCGTACTGCGTCTGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGTGCACGTACTGATGCACATGCACTGA
> 2
TGACAGCTACTGACGTACGTACGTACGTCAGTACGTACGTACGTCAGTACGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGCACTGCATGACTGACGTACGTACGTACGTACGT

We can use AWK to tidy this into lines of equal length, as follows:

awk -v len=40 -F "" '/^>/ {print}; !/^>/ {for (i=1; i<=NF; i++) {printf $(i); if (i % len == 0 || i == NF) printf "\n"}}' fasta.fasta
> 1
AGTACGATCTACGTACGCAACTGAGCTACTACAGTCATGC
TGACACTGACTGACACTGACTGACTGTGACACTGACTGCA
TGCTGCTGGCCCCGCAGTATCGACTGCGTACGTCGCGCGA
TTACGCGTACTGCGTCTGCATGCATGCATGCATGCATGCA
TGCATGCATGCATGCATGTGCACGTACTGATGCACATGCA
CTGA
> 2
TGACAGCTACTGACGTACGTACGTACGTCAGTACGTACGT
ACGTCAGTACGTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTAGCACTGCATGACTGACGTACGT
ACGTACGTACGT
awk -v len=10 -F "" '/^>/ {print}; !/^>/ {for (i=1; i<=NF; i++) {printf $(i); if (i % len == 0 || i == NF) printf "\n"}}' fasta.fasta
> 1
AGTACGATCT
ACGTACGCAA
CTGAGCTACT
ACAGTCATGC
TGACACTGAC
TGACACTGAC
TGACTGTGAC
ACTGACTGCA
TGCTGCTGGC
CCCGCAGTAT
CGACTGCGTA
CGTCGCGCGA
TTACGCGTAC
TGCGTCTGCA
TGCATGCATG
CATGCATGCA
TGCATGCATG
CATGCATGTG
CACGTACTGA
TGCACATGCA
CTGA
> 2
TGACAGCTAC
TGACGTACGT
ACGTACGTCA
GTACGTACGT
ACGTCAGTAC
GTTTTTTTTT
TTTTTTTTTT
TTTTTTTTTT
TTTTTTTTTT
TTTTTTTAGC
ACTGCATGAC
TGACGTACGT
ACGTACGTAC
GT

AWK doesn't have to be a one-liner, either:

awk -v len=80 -F "" '/^>/ {print};
  !/^>/ {
    for (i=1; i<=NF; i++) {
      printf $(i);
      if (i % len == 0 || i == NF)
        printf "\n"
    }
  }' fasta.fasta

> 1
AGTACGATCTACGTACGCAACTGAGCTACTACAGTCATGCTGACACTGACTGACACTGACTGACTGTGACACTGACTGCA
TGCTGCTGGCCCCGCAGTATCGACTGCGTACGTCGCGCGATTACGCGTACTGCGTCTGCATGCATGCATGCATGCATGCA
TGCATGCATGCATGCATGTGCACGTACTGATGCACATGCACTGA
> 2
TGACAGCTACTGACGTACGTACGTACGTCAGTACGTACGTACGTCAGTACGTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTAGCACTGCATGACTGACGTACGTACGTACGTACGT

Kevin

fasta awk • 1.4k views
ADD COMMENT
3
Entering edit mode

To linearize fasta use @Pierre's code so then you can use @Kevin's code

ADD REPLY

Login before adding your answer.

Traffic: 1969 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6