Biostar Beta. Not for public use.
How to extract upper/lowercase from fasta to output position and +/-
0
Entering edit mode
4.8 years ago
MJS • 0
@MJS25617

I'm looking for help on how to extract uppercase/lowercase information from multi fasta format file. I have been using the 'SNPable regions' method to mask a non-reference genome (http://lh3lh3.users.sourceforge.net/snpable.shtml). Hence, presently I have my information in the following format:

>Contig1
TCCcaccgcaacgaactg
>Contig2
ggTCgtgaagtaaaaaaa
>etc...

I want to compare this to another masking approach and need to format it. For simple comparison, I'd like it formattet as contig / position / + for uppercase or - for lowercase, i.e.:

Contig1 1 +
Contig1 2 +
Contig1 3 +
Contig1 4 -
Contig1 5 -
.......
.......
Contig2 1 -
Contig2 2 -
Contig2 3 +
Contig2 4 +
Contig2 5 -
....
....

Any simple fix?

uppercase lowercase fasta extract • 1.3k views
ADD COMMENTlink
2
Entering edit mode
4.8 years ago
michael.ante ♦ 3.3k
@michael.ante14739

A quick and simple solution would be using awk on the fasta file. Splitting the string into an array of characters and check with a regex whether the item is lower case or not:

awk '{if(NR%2==1){id=substr($1,2)};if(NR%2==0){n=split($0,a,""); for(i=1;i<=n;i++){if(match(a[i],/[a-z]/)){x="-"}else{x="+"}; printf "%s\t%d\t%s\n" , id,i,x}}}' test.fa

This will result in:

Contig1 1       +
Contig1 2       +
Contig1 3       +
Contig1 4       -
Contig1 5       -
...
Contig1 16      -
Contig1 17      -
Contig1 18      -
Contig2 1       -
Contig2 2       -
Contig2 3       +
Contig2 4       +
Contig2 5       -
Contig2 6       -
...

Denote, this will only work if your FASTA file is structured like the one in your example.

[Edit:] The script had a off-by-one error (i=0 instead of i=1).

ADD COMMENTlink
0
Entering edit mode
4.8 years ago
MJS • 0
@MJS25617

Thanks a lot Michael, very elegant solution :)

Ahh, you are correct with regards to structuring the FASTA file. I over simplified it a bit too much. In truth it is a multi fasta file with new line inserts for every 60 nt, so your quick fix only works for the first sequence line following >Contig. It is a very standard file like so:

>Contig1
ggaccagagaggttctcaccttcagtgcggcgatgaagttgtgTCCCGtCatcccgtcac
cagacatccacccgttgctgcgccaatcacagaactccttaaagccagttccctgatatg
acgccaaaaacttggcttctcgggctgctgcccgcgcctttcttgaagcgttcaacccgg
>Contig2
aacgcgtgttcgttgctgctgttgggttatgcaGTTTTGACCGTGGCGCAAAATACAAGA
AGCATAGCGCAAAGTGACGTTATTTAGCGATCAGTGAACACGCGAGCATTGACTAACGGA
AAAAGGGAAAAAGCATACGTACTGCTAACGCAGGCGCTCAGCCTGACGAAGGCGACGCGT
>etc
...

Perhaps some quick fix to your solution can overcome this? Otherwise removing all line breaks in the file and rerunning it?

ADD COMMENTlink
2
Entering edit mode

Search "linearize fasta" and pipe output of that to michael.ante's solution..

ADD REPLYlink
1
Entering edit mode

Correct, I'd go for the Fasta formatter from the fastx toolkit.

ADD REPLYlink
0
Entering edit mode

Thank you for the help

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.3