Question: How to extract upper/lowercase from fasta to output position and +/-
Entering edit mode

I'm looking for help on how to extract uppercase/lowercase information from multi fasta format file. I have been using the 'SNPable regions' method to mask a non-reference genome ( Hence, presently I have my information in the following format:


I want to compare this to another masking approach and need to format it. For simple comparison, I'd like it formattet as contig / position / + for uppercase or - for lowercase, i.e.:

Contig1 1 +
Contig1 2 +
Contig1 3 +
Contig1 4 -
Contig1 5 -
Contig2 1 -
Contig2 2 -
Contig2 3 +
Contig2 4 +
Contig2 5 -

Any simple fix?

Entering edit mode

A quick and simple solution would be using awk on the fasta file. Splitting the string into an array of characters and check with a regex whether the item is lower case or not:

awk '{if(NR%2==1){id=substr($1,2)};if(NR%2==0){n=split($0,a,""); for(i=1;i<=n;i++){if(match(a[i],/[a-z]/)){x="-"}else{x="+"}; printf "%s\t%d\t%s\n" , id,i,x}}}' test.fa

This will result in:

Contig1 1       +
Contig1 2       +
Contig1 3       +
Contig1 4       -
Contig1 5       -
Contig1 16      -
Contig1 17      -
Contig1 18      -
Contig2 1       -
Contig2 2       -
Contig2 3       +
Contig2 4       +
Contig2 5       -
Contig2 6       -

Denote, this will only work if your FASTA file is structured like the one in your example.

[Edit:] The script had a off-by-one error (i=0 instead of i=1).

ADD COMMENTlink 3.8 years ago michael.ante ♦ 3.3k
Entering edit mode

Thanks a lot Michael, very elegant solution :)

Ahh, you are correct with regards to structuring the FASTA file. I over simplified it a bit too much. In truth it is a multi fasta file with new line inserts for every 60 nt, so your quick fix only works for the first sequence line following >Contig. It is a very standard file like so:


Perhaps some quick fix to your solution can overcome this? Otherwise removing all line breaks in the file and rerunning it?

ADD COMMENTlink 3.8 years ago MJS • 0
Entering edit mode

Search "linearize fasta" and pipe output of that to michael.ante's solution..

ADD REPLYlink 3.8 years ago
Entering edit mode

Correct, I'd go for the Fasta formatter from the fastx toolkit.

ADD REPLYlink 3.8 years ago
♦ 3.3k
Entering edit mode

Thank you for the help

ADD REPLYlink 3.8 years ago
• 0

Login before adding your answer.

Powered by the version 1.8