Biostar Beta. Not for public use.
How can I remove sequences with only gaps (-) from a multi-fasta file?
2
Entering edit mode
15 months ago
jon.brate • 220
Norway

I have many files with single gene alignments in fasta fomat and many sequences in these consist of gaps only. How should I approach to remove these?

Thanks, Jon

ADD COMMENTlink
0
Entering edit mode
4 months ago
RamRS 21k
Houston, TX

Try bioawk. You can use awk-like filtering to achieve your goal.

ADD COMMENTlink
0
Entering edit mode

How?

Not obvious after RTFM of bioawk

ADD REPLYlink
0
Entering edit mode

can you post some example data and expected output? @ gbl1

ADD REPLYlink
0
Entering edit mode

Minimal input:

>M02764:85:000000000-BLWLC:1:1101:13787:1000 1:N:0:1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>M02764:85:000000000-BLWLC:1:1101:20057:1000 1:N:0:1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>M02764:85:000000000-BLWLC:1:1101:22051:1000 1:N:0:1
NNNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>M02764:85:000000000-BLWLC:1:1101:11481:1000 1:N:0:1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>M02764:85:000000000-BLWLC:1:1101:20267:1000 1:N:0:1
TNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

>M02764:85:000000000-BLWLC:1:2109:20914:16615 2:N:0:1

ACCCGCCTGTGTGCGGACAACGGTATCCCTCTTACCGAGAAGGCACCAGGCGACCACCTC
GGAAAGGCCAGCCACGTCTCCCCCTTGAGATGTGATAACATCAAGATGAAGAAGGCCTCC
AGACAGGCCACTAGCAGAGCTAAGTTTAGATATTCTGGACTTTCACACCAACAAGAAGAC
AGATGTCATGAAAAAGGTTGTTTACCTTCTCAAGAGCACCTCAAAAGAAGGATCTCCGGG
AACGACTCCATCATCCTGGGAACGCACATTAACCCGCTCCGTTGAG

>M02764:85:000000000-BLWLC:1:2109:24781:16615 2:N:0:1

CAATCCTATAACCCGGTCTCCCAACTAACACCATGAGACCGGTAAACTAGAAAGCCTATC
AAGGACAAACCTTTGCCTTGCGCATAATCCACTTGAGCTAGATGATGACGATCTTGACCT
CCTCAAGTTGGACCACCTTTCTTGATTCCGCACGCTCGATGAAGACTTGTAGATTGCTCC

>M02764:85:000000000-BLWLC:1:2114:20846:20863 2:N:0:1

TATAGTTTTATTGTGCAGAAGCGGCATGGAAGACAAGAAGCAAAATAATTATGAGGACGG
CGAAAGACTACATATCTTAGTTACCAGAAGCTAGGTGATCAAATCAATCTACATGATGCA
TCTCTGCCTATGTAGCTTTGCTTACCAGAAGCTGGGTGATCAAAGTAATCTACATGATGC
ATCTCTACCTATGTAGCTTTGCTTAGCATAAACTGATCTCATGACTTTCTTTTAGCCTAT
GAACCTATCATCAATTTAATATATAGCGCTACGTATGAACTATTTT

>M02764:85:000000000-BLWLC:1:2114:19960:20863 2:N:0:1

AGAGGAGCAGGCAACACCAGTGCACGCCACCCACCACAGGGCAGCGCCACCCACCAGGCC
TGAGGCCGCCGCCTCAGATCCGAGATAAACCACCCTCACCTGTTGCCCACCTACAACAGG
CCACAGGTGCGCAGCCAGCCAGCCCGGCTGCCGCACCGATGCAACTCCACCGCAGATTGC


Desired output: 

>M02764:85:000000000-BLWLC:1:2109:20914:16615 2:N:0:1

ACCCGCCTGTGTGCGGACAACGGTATCCCTCTTACCGAGAAGGCACCAGGCGACCACCTC
GGAAAGGCCAGCCACGTCTCCCCCTTGAGATGTGATAACATCAAGATGAAGAAGGCCTCC
AGACAGGCCACTAGCAGAGCTAAGTTTAGATATTCTGGACTTTCACACCAACAAGAAGAC
AGATGTCATGAAAAAGGTTGTTTACCTTCTCAAGAGCACCTCAAAAGAAGGATCTCCGGG
AACGACTCCATCATCCTGGGAACGCACATTAACCCGCTCCGTTGAG

>M02764:85:000000000-BLWLC:1:2109:24781:16615 2:N:0:1

CAATCCTATAACCCGGTCTCCCAACTAACACCATGAGACCGGTAAACTAGAAAGCCTATC
AAGGACAAACCTTTGCCTTGCGCATAATCCACTTGAGCTAGATGATGACGATCTTGACCT
CCTCAAGTTGGACCACCTTTCTTGATTCCGCACGCTCGATGAAGACTTGTAGATTGCTCC

>M02764:85:000000000-BLWLC:1:2114:20846:20863 2:N:0:1

TATAGTTTTATTGTGCAGAAGCGGCATGGAAGACAAGAAGCAAAATAATTATGAGGACGG
CGAAAGACTACATATCTTAGTTACCAGAAGCTAGGTGATCAAATCAATCTACATGATGCA
TCTCTGCCTATGTAGCTTTGCTTACCAGAAGCTGGGTGATCAAAGTAATCTACATGATGC
ATCTCTACCTATGTAGCTTTGCTTAGCATAAACTGATCTCATGACTTTCTTTTAGCCTAT
GAACCTATCATCAATTTAATATATAGCGCTACGTATGAACTATTTT

>M02764:85:000000000-BLWLC:1:2114:19960:20863 2:N:0:1

AGAGGAGCAGGCAACACCAGTGCACGCCACCCACCACAGGGCAGCGCCACCCACCAGGCC
TGAGGCCGCCGCCTCAGATCCGAGATAAACCACCCTCACCTGTTGCCCACCTACAACAGG
CCACAGGTGCGCAGCCAGCCAGCCCGGCTGCCGCACCGATGCAACTCCACCGCAGATTGC
ADD REPLYlink
0
Entering edit mode

I think there is some formatting problem in this post. I see only Ns in first sequences and I do not see any gaps. What is the requirement? from your input and output, try following using seqkit:

$ seqkit grep -w 0 -svrip "N"+ test1.fa 
>M02764:85:000000000-BLWLC:1:2109:20914:16615 2:N:0:1
ACCCGCCTGTGTGCGGACAACGGTATCCCTCTTACCGAGAAGGCACCAGGCGACCACCTC GGAAAGGCCAGCCACGTCTCCCCCTTGAGATGTGATAACATCAAGATGAAGAAGGCCTCC AGACAGGCCACTAGCAGAGCTAAGTTTAGATATTCTGGACTTTCACACCAACAAGAAGAC AGATGTCATGAAAAAGGTTGTTTACCTTCTCAAGAGCACCTCAAAAGAAGGATCTCCGGG AACGACTCCATCATCCTGGGAACGCACATTAACCCGCTCCGTTGAG
> M02764:85:000000000-BLWLC:1:2109:24781:16615 2:N:0:1
CAATCCTATAACCCGGTCTCCCAACTAACACCATGAGACCGGTAAACTAGAAAGCCTATC AAGGACAAACCTTTGCCTTGCGCATAATCCACTTGAGCTAGATGATGACGATCTTGACCT CCTCAAGTTGGACCACCTTTCTTGATTCCGCACGCTCGATGAAGACTTGTAGATTGCTCC
>M02764:85:000000000-BLWLC:1:2114:20846:20863 2:N:0:1
TATAGTTTTATTGTGCAGAAGCGGCATGGAAGACAAGAAGCAAAATAATTATGAGGACGG CGAAAGACTACATATCTTAGTTACCAGAAGCTAGGTGATCAAATCAATCTACATGATGCA TCTCTGCCTATGTAGCTTTGCTTACCAGAAGCTGGGTGATCAAAGTAATCTACATGATGC ATCTCTACCTATGTAGCTTTGCTTAGCATAAACTGATCTCATGACTTTCTTTTAGCCTAT GAACCTATCATCAATTTAATATATAGCGCTACGTATGAACTATTTT
>M02764:85:000000000-BLWLC:1:2114:19960:20863 2:N:0:1
AGAGGAGCAGGCAACACCAGTGCACGCCACCCACCACAGGGCAGCGCCACCCACCAGGCC TGAGGCCGCCGCCTCAGATCCGAGATAAACCACCCTCACCTGTTGCCCACCTACAACAGG CCACAGGTGCGCAGCCAGCCAGCCCGGCTGCCGCACCGATGCAACTCCACCGCAGATTGC

ps: note that input is not clear. This script works on the example data posted here and output matches with example output. This script can be improved once you give some feed back. Script looks for N's and does a inverse grep.

ADD REPLYlink
0
Entering edit mode

indeed, formating issue...

So, i tried: bash: seqkit: command not found

ADD REPLYlink
0
Entering edit mode

Above I linked seqkit to github page and gnu-linux binary is available for download. If you have conda or brew installed, try conda install seqkit or brew install seqkit

ADD REPLYlink
0
Entering edit mode

Actually, I've got another way… I need to ask the computer service… I do not have admin right on my university computer

ADD REPLYlink
1
Entering edit mode

You don't need admin rights to install tools using conda.

ADD REPLYlink
1
Entering edit mode

Just to add to this: Admin rights are not needed to install conda or brew - Both give you ways to bypass that requirement. Once the tools themselves are installed, they (at least brew does this) work to ensure you don't use root access.

ADD REPLYlink
0
Entering edit mode

seqkit doesn't need installation. Download binary, keep it some where in user home directory and add to user path. In the mean time try this awk script (modified after sayuj.koyyappurath script in https://www.biostars.org/p/9262/):

 $ awk -F "\t" 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n","")}1' test.fa | awk -v OFS="\n" -F "\t" '$2 !~ /N+{10}/ {print ">"$1,$2}' | sed '1,2d'

output:

>M02764:85:000000000-BLWLC:1:2109:20914:16615 2:N:0:1
ACCCGCCTGTGTGCGGACAACGGTATCCCTCTTACCGAGAAGGCACCAGGCGACCACCTCGGAAAGGCCAGCCACGTCTCCCCCTTGAGATGTGATAACATCAAGATGAAGAAGGCCTCCAGACAGGCCACTAGCAGAGCTAAGTTTAGATATTCTGGACTTTCACACCAACAAGAAGACAGATGTCATGAAAAAGGTTGTTTACCTTCTCAAGAGCACCTCAAAAGAAGGATCTCCGGGAACGACTCCATCATCCTGGGAACGCACATTAACCCGCTCCGTTGAG
>M02764:85:000000000-BLWLC:1:2109:24781:16615 2:N:0:1
CAATCCTATAACCCGGTCTCCCAACTAACACCATGAGACCGGTAAACTAGAAAGCCTATCAAGGACAAACCTTTGCCTTGCGCATAATCCACTTGAGCTAGATGATGACGATCTTGACCTCCTCAAGTTGGACCACCTTTCTTGATTCCGCACGCTCGATGAAGACTTGTAGATTGCTCC
>M02764:85:000000000-BLWLC:1:2114:20846:20863 2:N:0:1
TATAGTTTTATTGTGCAGAAGCGGCATGGAAGACAAGAAGCAAAATAATTATGAGGACGGCGAAAGACTACATATCTTAGTTACCAGAAGCTAGGTGATCAAATCAATCTACATGATGCATCTCTGCCTATGTAGCTTTGCTTACCAGAAGCTGGGTGATCAAAGTAATCTACATGATGCATCTCTACCTATGTAGCTTTGCTTAGCATAAACTGATCTCATGACTTTCTTTTAGCCTATGAACCTATCATCAATTTAATATATAGCGCTACGTATGAACTATTTT
>M02764:85:000000000-BLWLC:1:2114:19960:20863 2:N:0:1
AGAGGAGCAGGCAACACCAGTGCACGCCACCCACCACAGGGCAGCGCCACCCACCAGGCCTGAGGCCGCCGCCTCAGATCCGAGATAAACCACCCTCACCTGTTGCCCACCTACAACAGGCCACAGGTGCGCAGCCAGCCAGCCCGGCTGCCGCACCGATGCAACTCCACCGCAGATTGC
ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1