How to remove Bad Nucleotides represented by "N" from FASTA file by using UNIX?
8
3
Entering edit mode
6.4 years ago

How to remove Bad Nucleotides represented by "N" from Fasta file by using UNIX? Thanks in advance

sequence • 15k views
ADD COMMENT
2
Entering edit mode

If your aim is to remove sequences with particular % or number of Ns then you can try Prinseq-lite with -ns_max_p and -ns_max_n option respectively.

Even it can help you remove leading and trailing Ns using -trim_ns_left and -trim_ns_right option

ADD REPLY
0
Entering edit mode

Do you need to trim leading/trailing N's?

ADD REPLY
0
Entering edit mode

Yes

ADD REPLY
0
Entering edit mode

Dear, I have been trying to use the supplied command for removing the "N", but the output file does not allow me to do FastQc. Could you help me fix this? The command I used was this:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < elimination_n | sed 's/N//g' | tr "\t" "\n" > elimination1

Thanks!

ADD REPLY
0
Entering edit mode

Solution below only works for fasta format files. What format is your file in? If you have fastq format data then this solution will not work. If you want to show us output of head -4 elimination_n.

ADD REPLY
0
Entering edit mode

My files are in fastq format :( . Thank you very much for your help. Is there a command that allows me to remove them from the fastq format?

ADD REPLY
0
Entering edit mode

I removed the first 20 bases from my reads and decreased the "N" content. Will it be necessary to eliminate those misnamed bases from my reads? fastQc

ADD REPLY
4
Entering edit mode
6.4 years ago
GenoMax 141k

If you don't care about changing the coordinates then sed 's/N//g' your_fasta > new.fa should do it. Note: This sledgehammer solution will remove any N's that may be in the fasta headers too.

Edit (2019): Following solution will prevent white space the solution above generates. Linearizing fasta code (first part) courtesy of @Pierre.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < your_file | sed 's/N//g' | tr "\t" "\n" > new_file

If you need NN nucleotides per line (instead of a single line) then fold your file like this: fold -w NN your_file > new_file.

ADD COMMENT
0
Entering edit mode

Thank you for your reply .However, I have already used sed 's/N//g' your_fasta > new.fa with no change I mean "N" is not removed

ADD REPLY
2
Entering edit mode

What do you mean "with no change"? This is about the simplest replacement task for sed. You have to use your own file names in place of the place holders above .. just in case.

ADD REPLY
0
Entering edit mode

This would remove Ns in fasta headers as well :)

ADD REPLY
0
Entering edit mode

Noted as such already in the original answer :)

ADD REPLY
0
Entering edit mode

Can you show us the output of

sed 's/N//g' your_fasta | grep 'N'

Or if you get a lot of output, show us a part.

ADD REPLY
0
Entering edit mode

Hi Genomax, I used sed 's/N//g' your_fasta > new.fa to remove all 'N' from a fasta sequence. It worked but now it has white space in the places of N's. Can you tell me how to get rid of these spaces as well? Following is how it looks like now.

Thank you.

>DAT1-COMP102480-C1-SEQ1-1788-1
ATGGCGCCACATGAACTCCGGCGTACTTTTAAGCGCACGGCAATCTCGGA
TCAACAACGGCGAAGAGATATCGCGCTTCTACGGCAGAACCAGCAGCGTT
CCGACTCACAGAATCGTGCCCGCCGCCTCGCCTCTTCTGTCCTCGCCATT
CCCGACCACTCATCTCCGGCCGAAGCCCAAGTCGACCTCCCCGACGTCGT
AGATGTCCATACCGATTTGTATCTGGATCATTCTTCGGAGCCGGAGGCCG
CTTCTCCTGCAGGCAGACAGTTGGATGTGGTCGAAGCCTCAGATTTGAAG
GGCTGGACGGCCCGCCACTGGTTCTCCCGCCAGCTTATGCTCATGGAATG
GATGATTGACGTGCCTCCTAGCCTCGATCGCGATTG



GTACGTCTTTGCAAGACC
TTCTGGTAAGCGCTGCTTTGTTGTTTCTTCAAATGGTACCACAGTGAGCA
GGCTTCGTAATGGCTCTGTTTTGCATCGTTTTCCATCTTCCTTACCTAAT
GGCGCTAGGACAAAAGAAATATCAGCTCCATCACATGTTTTTTGTATACT
TGATTGCATTTTTCATGAG

CCTGATCAGACATTTTATGTGATTGATATGATTTGT
TGGCGAGGATACTCATTATATGATTGTAGTGCGGAGTTCAGATTTTTTTG
GTTGAACTCAAAGCTTTTGGAGACTGGAGCCTGTGATCTTCCTTCAGTAT
ACCATAGGTATAGATTCAGTGTTGTACCTGCTTATGAATGTAACCAGATA
GGCTTGCAAAAAGCATATACGTGTGGAGTGACGTTTGTTAAAGATGGCCT
ATTGTTCTACAACAG

GCATGCAAATTAT
CAGGCTGGGAATACTCCATTAGCACTAGTATGGAAGGATGAATTTTGTAG
CAAATATGTTTTGGATACAGACGGTGAAGGACAGGTTCCAATACAACAAC
A



GGTTGTCTTGGAGTTG
CAAGGTACTGGGAAGTTGATTACACATGATGATCCTCCAATTGTATTTGG
CTGCTTGGAGAGAGATTTCCTTCAAAAG





TCGG
GTTTGCAAGTTGGAAATCTTCTTCGGTTTTCCATCGTGAATGAAAGCGCG
AGGATAGTTGATGGCAAGCTGGAGTTGGGAGAGATTAAATTTCTCGGCAA
AGCAAACCGTTTTCGAGCTTTTGCAGATAGCTACTCAAAG

GTATTCTTCCAGCACAC
GGCCCGTCACTCTCCTCTTCAATTCATGGATCTGATGGTATCCGTGGATC
CGAGT
ADD REPLY
0
Entering edit mode

I have posted an updated solution in my original answer above. Can you try that out?

ADD REPLY
0
Entering edit mode

WOW!! it worked! Thanks!

ADD REPLY
4
Entering edit mode
6.4 years ago
Korsocius ▴ 250

sed -e '/^[^>]/s/[^ATGCatgc]/N/g' file.fa

ADD COMMENT
1
Entering edit mode

In combination with genomax' answer this gives:

sed -e '/^[^>]/s/N//g'

which leaves the headers alone and does the correct replacement. Wrecking the headers might destroy import sequence identifiers.

ADD REPLY
1
Entering edit mode
6.4 years ago
kloetzl ★ 1.1k

Using pfasta:

% pfasta acgt bad.fasta > good.fasta
ADD COMMENT
0
Entering edit mode

Thank you , I will try

ADD REPLY
0
Entering edit mode
6.4 years ago
Farbod ★ 3.4k

Dear @The Bright Star, Hi and welcome to Biostars

Have you checked the FASTA cleanup by Pierre Lindenbaum in Biostars?

And also "How to remove N from fasta sequences"?

ADD COMMENT
0
Entering edit mode

Hi Faburd, Yep and both don't work Cheers

ADD REPLY
0
Entering edit mode
6.4 years ago
Joe 21k

Same caveat as genomax, this will wreck your headers if they contain the 'N' string, but this is about as simple as it gets:

tr -d 'N' < seqs.fasta

If you need to preserve the headers, then we'll have to get a bit more inventive with Biopython or some other proper parser.

Test data before:

>tpg|Magnaporthiopsis_incrustans|JF414846
ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGCNNNNNN
>tpg|Pyricularia_pennisetigena|AB818016
NNNNNNGCAAGNTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAA
>tpg|Inocybe_sororia|EU525947
NNNAACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGCNNN

Test data after:

>tpg|Magnaporthiopsis_incrustans|JF414846
ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGC
>tpg|Pyricularia_pennisetigena|AB818016
GCAAGTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAA
>tpg|Inocybe_sororia|EU525947
AACCAGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGC
ADD COMMENT
0
Entering edit mode
6.4 years ago

using seqkit: case insensitive and affects only sequence

$ seqkit -is replace -p "n" -r "" test.fa

with sed:

$ sed -e '/^>/! s/[Nn]//g' test.fa

ouput:

>NnT
AT

input:

$ cat test.fa 
>NnT
nNANnTnN
ADD COMMENT
0
Entering edit mode

To remove leading and trailing Ns:

$ sed -e '/^>/! s/^[Nn]\+//g;s/[Nn]\+$//g' test1.fa

output:

>NnT
ANnT

input:

  $ cat test1.fa 
    >NnT
    nnnnnnnnNnnnnnnnNANnTnNnnnnnnnnnnnnnnnNnnnn
ADD REPLY
0
Entering edit mode
6.4 years ago
st.ph.n ★ 2.7k

Linearize your fasta if multi-line.

Trim leading and trailing ends, by splitting sequence:

#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:
        for line in f:
                if line.startswith(">"):
                        header = line.strip()
                        seq = next(f).strip()
                        trim = max(seq.split('N'), key=len)
                        print header, '\n', trim

Save as trim_N.py, run as python trim_N.py input.fasta > output.fasta.

Example input/output:

>1
NNNNNNNNNGGGAGGTGTTTTGGTCCTTGATCCTATTGCCTACGGCAGCCGCTGGATTGTTATTACTCGCGGCCCAGCCGGCCATGGCCCAGGTTCAGCTGCAGCAGTCTGGGGCTGAGCTGGTGAAGCNNNNNNNNNNN

>1
GGGAGGTGTTTTGGTCCTTGATCCTATTGCCTACGGCAGCCGCTGGATTGTTATTACTCGCGGCCCAGCCGGCCATGGCCCAGGTTCAGCTGCAGCAGTCTGGGGCTGAGCTGGTGAAGC

Alternatively, if you also want to remove N's that are in the middle of the sequence:

#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:
        for line in f:
                if line.startswith(">"):
                        header = line.strip()
                        seq = next(f).strip()
                        trim = ''.join(seq.split('N'))
                        print header, '\n', trim

Save and run same as above.

Example input/output:

>1
NNNNNNNNNGGGAGGTGTTTTGGTCCTTGATCCTATTGCCTACGGCNNNNNAGCCGCTGGATTGTTATTACTCNNNNNGCGGCCCAGCCGGCCATGGCCCAGGTTCAGCTGCAGCAGTCTGGGGCTGAGCTGGTGAAGCNNNNNNNNNNN

>1
GGGAGGTGTTTTGGTCCTTGATCCTATTGCCTACGGCAGCCGCTGGATTGTTATTACTCGCGGCCCAGCCGGCCATGGCCCAGGTTCAGCTGCAGCAGTCTGGGGCTGAGCTGGTGAAGC
ADD COMMENT
0
Entering edit mode
6.4 years ago
steve.bond • 0

Not a pure UNIX solution, but the SeqBuddy --replace_subseq command can clean up any arbitrary sequence pattern from any standard sequence or alignment format.

$: seqbuddy.py <input file> --replace_subseq 'N'

Sample input:

>Bca-PanxA Random meta data with NNNNs in it
NNGGACATTTTAAGNGTCGTCACTCGTTTCCCTATACTAGNGTTNGGNGTAGAACGTCAC
GANGANGACTTNGCAGACAGAATAAACTACAAGTATACGG
>Pae-PanxB
ANGTTNGACGTCTTNGGATCNGTAAAGGGCCTACTAAAACTNGACAGCGNGNGCATAGAT
AATAACGTATTCCGGCTTCATTATAAAGCTACNGTAATAA

Result:

>Bca-PanxA Random meta data with NNNNs in it
GGACATTTTAAGGTCGTCACTCGTTTCCCTATACTAGGTTGGGTAGAACGTCACGAGAGA
CTTGCAGACAGAATAAACTACAAGTATACGG
>Pae-PanxB
AGTTGACGTCTTGGATCGTAAAGGGCCTACTAAAACTGACAGCGGGCATAGATAATAACG
TATTCCGGCTTCATTATAAAGCTACGTAATAA
ADD COMMENT

Login before adding your answer.

Traffic: 1884 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6