How can I remove sequences with only gaps (-) from a multi-fasta file?
1
2
Entering edit mode
9.0 years ago
jon.brate ▴ 290

I have many files with single gene alignments in fasta fomat and many sequences in these consist of gaps only. How should I approach to remove these?

Thanks, Jon

fasta trimming unix • 5.3k views
ADD COMMENT
0
Entering edit mode
9.0 years ago
Ram 43k

Try bioawk. You can use awk-like filtering to achieve your goal.

ADD COMMENT
1
Entering edit mode

How?

Not obvious after RTFM of bioawk

ADD REPLY
0
Entering edit mode

can you post some example data and expected output? @ gbl1

ADD REPLY
0
Entering edit mode

Minimal input:

>M02764:85:000000000-BLWLC:1:1101:13787:1000 1:N:0:1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>M02764:85:000000000-BLWLC:1:1101:20057:1000 1:N:0:1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>M02764:85:000000000-BLWLC:1:1101:22051:1000 1:N:0:1
NNNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>M02764:85:000000000-BLWLC:1:1101:11481:1000 1:N:0:1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>M02764:85:000000000-BLWLC:1:1101:20267:1000 1:N:0:1
TNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

>M02764:85:000000000-BLWLC:1:2109:20914:16615 2:N:0:1

ACCCGCCTGTGTGCGGACAACGGTATCCCTCTTACCGAGAAGGCACCAGGCGACCACCTC
GGAAAGGCCAGCCACGTCTCCCCCTTGAGATGTGATAACATCAAGATGAAGAAGGCCTCC
AGACAGGCCACTAGCAGAGCTAAGTTTAGATATTCTGGACTTTCACACCAACAAGAAGAC
AGATGTCATGAAAAAGGTTGTTTACCTTCTCAAGAGCACCTCAAAAGAAGGATCTCCGGG
AACGACTCCATCATCCTGGGAACGCACATTAACCCGCTCCGTTGAG

>M02764:85:000000000-BLWLC:1:2109:24781:16615 2:N:0:1

CAATCCTATAACCCGGTCTCCCAACTAACACCATGAGACCGGTAAACTAGAAAGCCTATC
AAGGACAAACCTTTGCCTTGCGCATAATCCACTTGAGCTAGATGATGACGATCTTGACCT
CCTCAAGTTGGACCACCTTTCTTGATTCCGCACGCTCGATGAAGACTTGTAGATTGCTCC

>M02764:85:000000000-BLWLC:1:2114:20846:20863 2:N:0:1

TATAGTTTTATTGTGCAGAAGCGGCATGGAAGACAAGAAGCAAAATAATTATGAGGACGG
CGAAAGACTACATATCTTAGTTACCAGAAGCTAGGTGATCAAATCAATCTACATGATGCA
TCTCTGCCTATGTAGCTTTGCTTACCAGAAGCTGGGTGATCAAAGTAATCTACATGATGC
ATCTCTACCTATGTAGCTTTGCTTAGCATAAACTGATCTCATGACTTTCTTTTAGCCTAT
GAACCTATCATCAATTTAATATATAGCGCTACGTATGAACTATTTT

>M02764:85:000000000-BLWLC:1:2114:19960:20863 2:N:0:1

AGAGGAGCAGGCAACACCAGTGCACGCCACCCACCACAGGGCAGCGCCACCCACCAGGCC
TGAGGCCGCCGCCTCAGATCCGAGATAAACCACCCTCACCTGTTGCCCACCTACAACAGG
CCACAGGTGCGCAGCCAGCCAGCCCGGCTGCCGCACCGATGCAACTCCACCGCAGATTGC


Desired output: 

>M02764:85:000000000-BLWLC:1:2109:20914:16615 2:N:0:1

ACCCGCCTGTGTGCGGACAACGGTATCCCTCTTACCGAGAAGGCACCAGGCGACCACCTC
GGAAAGGCCAGCCACGTCTCCCCCTTGAGATGTGATAACATCAAGATGAAGAAGGCCTCC
AGACAGGCCACTAGCAGAGCTAAGTTTAGATATTCTGGACTTTCACACCAACAAGAAGAC
AGATGTCATGAAAAAGGTTGTTTACCTTCTCAAGAGCACCTCAAAAGAAGGATCTCCGGG
AACGACTCCATCATCCTGGGAACGCACATTAACCCGCTCCGTTGAG

>M02764:85:000000000-BLWLC:1:2109:24781:16615 2:N:0:1

CAATCCTATAACCCGGTCTCCCAACTAACACCATGAGACCGGTAAACTAGAAAGCCTATC
AAGGACAAACCTTTGCCTTGCGCATAATCCACTTGAGCTAGATGATGACGATCTTGACCT
CCTCAAGTTGGACCACCTTTCTTGATTCCGCACGCTCGATGAAGACTTGTAGATTGCTCC

>M02764:85:000000000-BLWLC:1:2114:20846:20863 2:N:0:1

TATAGTTTTATTGTGCAGAAGCGGCATGGAAGACAAGAAGCAAAATAATTATGAGGACGG
CGAAAGACTACATATCTTAGTTACCAGAAGCTAGGTGATCAAATCAATCTACATGATGCA
TCTCTGCCTATGTAGCTTTGCTTACCAGAAGCTGGGTGATCAAAGTAATCTACATGATGC
ATCTCTACCTATGTAGCTTTGCTTAGCATAAACTGATCTCATGACTTTCTTTTAGCCTAT
GAACCTATCATCAATTTAATATATAGCGCTACGTATGAACTATTTT

>M02764:85:000000000-BLWLC:1:2114:19960:20863 2:N:0:1

AGAGGAGCAGGCAACACCAGTGCACGCCACCCACCACAGGGCAGCGCCACCCACCAGGCC
TGAGGCCGCCGCCTCAGATCCGAGATAAACCACCCTCACCTGTTGCCCACCTACAACAGG
CCACAGGTGCGCAGCCAGCCAGCCCGGCTGCCGCACCGATGCAACTCCACCGCAGATTGC
ADD REPLY
0
Entering edit mode

I think there is some formatting problem in this post. I see only Ns in first sequences and I do not see any gaps. What is the requirement? from your input and output, try following using seqkit:

$ seqkit grep -w 0 -svrip "N"+ test1.fa 
>M02764:85:000000000-BLWLC:1:2109:20914:16615 2:N:0:1
ACCCGCCTGTGTGCGGACAACGGTATCCCTCTTACCGAGAAGGCACCAGGCGACCACCTC GGAAAGGCCAGCCACGTCTCCCCCTTGAGATGTGATAACATCAAGATGAAGAAGGCCTCC AGACAGGCCACTAGCAGAGCTAAGTTTAGATATTCTGGACTTTCACACCAACAAGAAGAC AGATGTCATGAAAAAGGTTGTTTACCTTCTCAAGAGCACCTCAAAAGAAGGATCTCCGGG AACGACTCCATCATCCTGGGAACGCACATTAACCCGCTCCGTTGAG
> M02764:85:000000000-BLWLC:1:2109:24781:16615 2:N:0:1
CAATCCTATAACCCGGTCTCCCAACTAACACCATGAGACCGGTAAACTAGAAAGCCTATC AAGGACAAACCTTTGCCTTGCGCATAATCCACTTGAGCTAGATGATGACGATCTTGACCT CCTCAAGTTGGACCACCTTTCTTGATTCCGCACGCTCGATGAAGACTTGTAGATTGCTCC
>M02764:85:000000000-BLWLC:1:2114:20846:20863 2:N:0:1
TATAGTTTTATTGTGCAGAAGCGGCATGGAAGACAAGAAGCAAAATAATTATGAGGACGG CGAAAGACTACATATCTTAGTTACCAGAAGCTAGGTGATCAAATCAATCTACATGATGCA TCTCTGCCTATGTAGCTTTGCTTACCAGAAGCTGGGTGATCAAAGTAATCTACATGATGC ATCTCTACCTATGTAGCTTTGCTTAGCATAAACTGATCTCATGACTTTCTTTTAGCCTAT GAACCTATCATCAATTTAATATATAGCGCTACGTATGAACTATTTT
>M02764:85:000000000-BLWLC:1:2114:19960:20863 2:N:0:1
AGAGGAGCAGGCAACACCAGTGCACGCCACCCACCACAGGGCAGCGCCACCCACCAGGCC TGAGGCCGCCGCCTCAGATCCGAGATAAACCACCCTCACCTGTTGCCCACCTACAACAGG CCACAGGTGCGCAGCCAGCCAGCCCGGCTGCCGCACCGATGCAACTCCACCGCAGATTGC

ps: note that input is not clear. This script works on the example data posted here and output matches with example output. This script can be improved once you give some feed back. Script looks for N's and does a inverse grep.

ADD REPLY
0
Entering edit mode

indeed, formating issue...

So, i tried: bash: seqkit: command not found

ADD REPLY
0
Entering edit mode

Above I linked seqkit to github page and gnu-linux binary is available for download. If you have conda or brew installed, try conda install seqkit or brew install seqkit

ADD REPLY
0
Entering edit mode

Actually, I've got another way… I need to ask the computer service… I do not have admin right on my university computer

ADD REPLY
1
Entering edit mode

You don't need admin rights to install tools using conda.

ADD REPLY
1
Entering edit mode

Just to add to this: Admin rights are not needed to install conda or brew - Both give you ways to bypass that requirement. Once the tools themselves are installed, they (at least brew does this) work to ensure you don't use root access.

ADD REPLY
0
Entering edit mode

seqkit doesn't need installation. Download binary, keep it some where in user home directory and add to user path. In the mean time try this awk script (modified after sayuj.koyyappurath script in https://www.biostars.org/p/9262/):

 $ awk -F "\t" 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n","")}1' test.fa | awk -v OFS="\n" -F "\t" '$2 !~ /N+{10}/ {print ">"$1,$2}' | sed '1,2d'

output:

>M02764:85:000000000-BLWLC:1:2109:20914:16615 2:N:0:1
ACCCGCCTGTGTGCGGACAACGGTATCCCTCTTACCGAGAAGGCACCAGGCGACCACCTCGGAAAGGCCAGCCACGTCTCCCCCTTGAGATGTGATAACATCAAGATGAAGAAGGCCTCCAGACAGGCCACTAGCAGAGCTAAGTTTAGATATTCTGGACTTTCACACCAACAAGAAGACAGATGTCATGAAAAAGGTTGTTTACCTTCTCAAGAGCACCTCAAAAGAAGGATCTCCGGGAACGACTCCATCATCCTGGGAACGCACATTAACCCGCTCCGTTGAG
>M02764:85:000000000-BLWLC:1:2109:24781:16615 2:N:0:1
CAATCCTATAACCCGGTCTCCCAACTAACACCATGAGACCGGTAAACTAGAAAGCCTATCAAGGACAAACCTTTGCCTTGCGCATAATCCACTTGAGCTAGATGATGACGATCTTGACCTCCTCAAGTTGGACCACCTTTCTTGATTCCGCACGCTCGATGAAGACTTGTAGATTGCTCC
>M02764:85:000000000-BLWLC:1:2114:20846:20863 2:N:0:1
TATAGTTTTATTGTGCAGAAGCGGCATGGAAGACAAGAAGCAAAATAATTATGAGGACGGCGAAAGACTACATATCTTAGTTACCAGAAGCTAGGTGATCAAATCAATCTACATGATGCATCTCTGCCTATGTAGCTTTGCTTACCAGAAGCTGGGTGATCAAAGTAATCTACATGATGCATCTCTACCTATGTAGCTTTGCTTAGCATAAACTGATCTCATGACTTTCTTTTAGCCTATGAACCTATCATCAATTTAATATATAGCGCTACGTATGAACTATTTT
>M02764:85:000000000-BLWLC:1:2114:19960:20863 2:N:0:1
AGAGGAGCAGGCAACACCAGTGCACGCCACCCACCACAGGGCAGCGCCACCCACCAGGCCTGAGGCCGCCGCCTCAGATCCGAGATAAACCACCCTCACCTGTTGCCCACCTACAACAGGCCACAGGTGCGCAGCCAGCCAGCCCGGCTGCCGCACCGATGCAACTCCACCGCAGATTGC
ADD REPLY

Login before adding your answer.

Traffic: 2510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6