Suggestion on parsing multipe alignment in fasta format
0
0
Entering edit mode
2.9 years ago
QLFblaireau ▴ 30

Hello, I would like some suggestions for the following problems. I have many multiple alignments from mafft, in fasta format. Eahc of these alignment has a specific sequence for which I am interested to know 1 thing: 1) is there at least a gapless alignment block length of at least 180 base pairs?

Here is a sample, let's say I am interested to query the sequence "Mitra_TRINITY_DN40971_c28_g1_i1" and check for the length of alignment block in hope to find one of at least 180 bp (I don't know if the example I give here satisfies this condition it's just to display the file format).

Any existing parser or quick awk hack (in fact I feel there should be a rather easy awk hack but can't figure it out)? Thank you for any guidance :)

>Mitra_TRINITY_DN40971_c28_g1_i1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------t--------------------------------------ctc
aattattactgctt------tccaaaatcatgctgcac------------atcaattaat
gaggtgat-------ctgtcatggaattcctgacatgagaccattagaa--agtggtgac
attatcaacattgacat-----cactgcctaccatcggaattttcatggagacttgaacg
agaccatgtttgtcggggaggtggatgaggccagcaaagagctagtgaagacgacgcatg
aatgtctgatgatggccatcgatgaagtgaaaccaggggtg-------------------
------------------------------------------------------------
------cggtacagagagatgggaaacatcatccagaaa---cacgcccagtctcatggc
tactctgttgtccgcagctactgcggtcatgggattcaccaattgttccacactgctcca
agtgtgcctcattatgggaagaataaa------------gccatcggtgtgatgaaacca
ggtcatactttcaccattgagccaatgatatctcaaggcacctggcgtgatgagatgtgg
ccagacaactggactgcagtgacacaggacggcaaacgctcagcgcagtttgagcacaca
ctgctggtgact------------------------------------------------
-------------------------gacacaggatgcgaagtcttgacccgtcgtcgggc
aaatgatggacagccccactttatggactccaaatga------------------tggac
agcca-------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
----------------------------------------------------
>Mitra_TRINITY_DN40971_c28_g2_i1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---------------------------------------------------------gtg
ctcttccgatctgccaaagaaacac-----------------------------------
---------------------aggaggagagcaagaatggttacaacccattccccggct
ttcgttacacgggatcattgagagcattccctgtgacaccaaagagag------------
-agatgcctgacacaatccctcgtcctgattatgctgaccattcagaaggtatccccctg
tcagagcgccagatgagaggttcgacaaacatcaaacagctggacgatgaagagcaggaa
ggcatgcgagttgtgtgcaagttggctcgggaaatcctggatgaagcagccagtgctgtt
ggcattggggtaactacagatgaaattgacagactggtccatgagg-caacaatagacag
agagtgctatccatctcct--------------------------------------ctg
aactactactgctt------tcccaaatcatgctgcac------------gtctatcaat
gaagtgat-------ctgtcatggaattcctgacacgaggccattggaa--aatggtgac
attgtcaacattgacat-----tactgcgtaccatcggagctttcatggggacttaaacg
agaccatgtttgtcggagaggtagacgaggccagcaaggagctggtgaagaccacacacg
aatgcctcatgatggccattgaagaagtgaaacccggtgtg-------------------
------------------------------------------------------------
------cggtacagagagatgggtaacatcatccagaag---catgcccaggcccatgga
tactctgttgttcgcagctactgtggccatggt--------------------------
mafft multiple msa sequence clustal • 529 views
ADD COMMENT
0
Entering edit mode

see if this helps:

$ seqkit grep -sdip "n{180}" test.fa

>Mitra_TRINITY_DN40971_c28_g2_i1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---------------------------------------------------------gtg
ctcttccgatctgccaaagaaacac-----------------------------------
---------------------aggaggagagcaagaatggttacaacccattccccggct
ttcgttacacgggatcattgagagcattccctgtgacaccaaagagag------------
-agatgcctgacacaatccctcgtcctgattatgctgaccattcagaaggtatccccctg
tcagagcgccagatgagaggttcgacaaacatcaaacagctggacgatgaagagcaggaa
ggcatgcgagttgtgtgcaagttggctcgggaaatcctggatgaagcagccagtgctgtt
ggcattggggtaactacagatgaaattgacagactggtccatgagg-caacaatagacag
agagtgctatccatctcct--------------------------------------ctg
aactactactgctt------tcccaaatcatgctgcac------------gtctatcaat
gaagtgat-------ctgtcatggaattcctgacacgaggccattggaa--aatggtgac
attgtcaacattgacat-----tactgcgtaccatcggagctttcatggggacttaaacg
agaccatgtttgtcggagaggtagacgaggccagcaaggagctggtgaagaccacacacg
aatgcctcatgatggccattgaagaagtgaaacccggtgtg-------------------
------------------------------------------------------------
------cggtacagagagatgggtaacatcatccagaag---catgcccaggcccatgga
tactctgttgttcgcagctactgtggccatggt--------------------------

seqkit can be downloaded from here

with awk:

$ awk  '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}} END {print "\r"}' test.fa| awk -v RS='>' -v OFS="\n" '$2 ~ /[atgc]{180}/ {print ">"$1,$2}'

>Mitra_TRINITY_DN40971_c28_g2_i1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------gtgctcttccgatctgccaaagaaacac--------------------------------------------------------aggaggagagcaagaatggttacaacccattccccggctttcgttacacgggatcattgagagcattccctgtgacaccaaagagag-------------agatgcctgacacaatccctcgtcctgattatgctgaccattcagaaggtatccccctgtcagagcgccagatgagaggttcgacaaacatcaaacagctggacgatgaagagcaggaaggcatgcgagttgtgtgcaagttggctcgggaaatcctggatgaagcagccagtgctgttggcattggggtaactacagatgaaattgacagactggtccatgagg-caacaatagacagagagtgctatccatctcct--------------------------------------ctgaactactactgctt------tcccaaatcatgctgcac------------gtctatcaatgaagtgat-------ctgtcatggaattcctgacacgaggccattggaa--aatggtgacattgtcaacattgacat-----tactgcgtaccatcggagctttcatggggacttaaacgagaccatgtttgtcggagaggtagacgaggccagcaaggagctggtgaagaccacacacgaatgcctcatgatggccattgaagaagtgaaacccggtgtg-------------------------------------------------------------------------------------cggtacagagagatgggtaacatcatccagaag---catgcccaggcccatggatactctgttgttcgcagctactgtggccatggt--------------------------

First part of the code linearizes the fasta (from web/biostars) and second part does the job.

ADD REPLY

Login before adding your answer.

Traffic: 3003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6