How to extract all non-seqenced positions from a genome (Fasta file)?
1
1
Entering edit mode
9.1 years ago

Hello,

Sorry in advance for my English.

I want extract all positions of non-sequenced nucleotides for each chromosomes from a genome sequence file (Fasta).

To do that, I have a fasta file with the genome sequences, like this:

>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNtaaattgttt
taaattgtttctgtttgcagttgacatgatcttatatatagaaaacacca
ataactctgccaaaaaatttagaattcataaatgaatttagtaaagttgc

I want find all N positions and obtain this position in bed format, e.g.:

chr1 1 200 ...

I didn't found how to do that in Bedtools.

If you have any idea, could you help me?

Thank you

sequence genome • 5.8k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thank you, I think that resolve my problem.

ADD REPLY
0
Entering edit mode
Thank you for your quick answers
ADD REPLY
1
Entering edit mode
9.1 years ago
mxs ▴ 530

Well, this should not be a problematic task if I understood the question correctly. All you need to do is slide along your input fasta file and count the swithches N-> something and something to ->N

perl -ne 'chomp;if( />(.*)/){$head = $1; $i=0; next};@a=split("",$_); foreach(@a){$i++;if($_ eq "N" && $s ==0 ){print "$head\t$i"; $s =1}elsif($s==1 && $_ ne "N"){print "\t$i\n";$s=0}}' infile.fa

This should do the trick.

cheers

mxs

ADD COMMENT
1
Entering edit mode

After 4.1 years...

Hi @mxs, thanks for the script. For some reason, I was getting an error in the result. I modified your script slightly. Below are the steps:

$ cat test2.txt
>NZ_CP025963.1
ATGCAGTNNNACGTGCATGACTGTACGTANGTACGTGACTGACTGACTGACTNACTGACTGATCGTACNTACGTAC

$ perl -ne 'chomp;if( />(.*)/){$head = $1; $i=0; next};@a=split("",$_); foreach(@a){$i++;if($_ eq "N" && $s ==0 ){print "$head\t$i"; $s =1}elsif($s==1 && $_ ne "N"){print "\t$i\n";$s=0}}' test2.txt
NZ_CP025963.1   8   11
NZ_CP025963.1   30  31
NZ_CP025963.1   53  54
NZ_CP025963.1   69  70

$ cat >test2.bed
NZ_CP025963.1   8   11
NZ_CP025963.1   30  31
NZ_CP025963.1   53  54
NZ_CP025963.1   69  70

$ bedtools getfasta -fi test2.txt -bed test2.bed
index file test2.txt.fai not found, generating...
>NZ_CP025963.1:8-11
NNA
>NZ_CP025963.1:30-31
G
>NZ_CP025963.1:53-54
A
>NZ_CP025963.1:69-70
T

Instead of giving me the regions of N's, oneliner have extracted other regions also. Or may be it is because bedtools is zero based system.

Modified version:

$ perl -ne 'chomp;if( />(.*)/){$head = $1; $i=0; next};@a=split("",$_); foreach(@a){$i++; if($_ eq "N" && $s ==0 ){$z=$i-1; print "$head\t$z"; $s =1}elsif($s==1 && $_ ne "N"){$j=$i-1;print "\t$j\n";$s=0}}' test2.txt
NZ_CP025963.1   7   10 ("7" here is zero based coordinate and "10" is 1 based coordinate)
NZ_CP025963.1   29  30
NZ_CP025963.1   52  53
NZ_CP025963.1   68  69

$ bedtools getfasta -fi test2.txt -bed test2.bed
>NZ_CP025963.1:7-10
NNN
>NZ_CP025963.1:29-30
N
>NZ_CP025963.1:52-53
N
>NZ_CP025963.1:68-69
N

More info on zerobased and one-based system is here.

ADD REPLY
3
Entering edit mode

Though this is a nice oneliner it is very slow on large files.

Although I did not search for a particular fast way to do this it took that much time and such a huge amount of memory(!) that I decided to quickly write it in python. It does use Biopython though.

#!/usr/bin/python3.6
import sys

#import the SeqIO module from Biopython
from Bio import SeqIO
with open(sys.argv[1], mode="r") as fasta_handle:
for record in SeqIO.parse(fasta_handle, "fasta"):
    start_pos=0
    counter=0
    gap=False
    gap_length = 0
    for char in record.seq:
        if char == 'N':
            if gap_length == 0:
                start_pos=counter
                gap_length = 1
                gap = True
            else:
                gap_length += 1
        else:
            if gap:
                printrecord.id + "\t" + str(start_pos) + "\t" + str(start_pos + gap_length))
                gap_length = 0
                gap = False
        counter += 1
ADD REPLY
0
Entering edit mode

Thanks! this worked awesome!

I had to reindent it a little but thats it:

#!/usr/bin/env  python3

import sys

# import the SeqIO module from Biopython
from Bio import SeqIO

with open(sys.argv[1], mode="r") as fasta_handle:
    for record in SeqIO.parse(fasta_handle, "fasta"):
        start_pos = 0
        counter = 0
        gap = False
        gap_length = 0
    for char in record.seq:
        if char == "N":
            if gap_length == 0:
                start_pos = counter
                gap_length = 1
                gap = True
            else:
                gap_length += 1
        else:
            if gap:
                print(
                    record.id
                    + "\t"
                    + str(start_pos)
                    + "\t"
                    + str(start_pos + gap_length)
                )
                gap_length = 0
                gap = False
        counter += 1
ADD REPLY

Login before adding your answer.

Traffic: 1895 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6