Question

How to extract all non-seqenced positions from a genome (Fasta file)?

1

Entering edit mode

9.1 years ago

TheBabyBlueEyes ▴ 10

Hello,

Sorry in advance for my English.

I want extract all positions of non-sequenced nucleotides for each chromosomes from a genome sequence file (Fasta).

To do that, I have a fasta file with the genome sequences, like this:

>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNtaaattgttt
taaattgtttctgtttgcagttgacatgatcttatatatagaaaacacca
ataactctgccaaaaaatttagaattcataaatgaatttagtaaagttgc

I want find all N positions and obtain this position in bed format, e.g.:

chr1 1 200 ...

I didn't found how to do that in Bedtools.

If you have any idea, could you help me?

Thank you

sequence genome • 5.8k views

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.1 years ago by TheBabyBlueEyes ▴ 10

0

Entering edit mode

may be you're looking for http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz ?

ADD REPLY • link 9.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you, I think that resolve my problem.

ADD REPLY • link 9.1 years ago by TheBabyBlueEyes ▴ 10

0

Entering edit mode

Thank you for your quick answers

ADD REPLY • link 9.1 years ago by TheBabyBlueEyes ▴ 10

score 1 · Accepted Answer · 2015-03-09

1

Entering edit mode

9.1 years ago

mxs ▴ 530

Well, this should not be a problematic task if I understood the question correctly. All you need to do is slide along your input fasta file and count the swithches N-> something and something to ->N

perl -ne 'chomp;if( />(.*)/){$head = $1; $i=0; next};@a=split("",$_); foreach(@a){$i++;if($_ eq "N" && $s ==0 ){print "$head\t$i"; $s =1}elsif($s==1 && $_ ne "N"){print "\t$i\n";$s=0}}' infile.fa

This should do the trick.

cheers

mxs

ADD COMMENT • link 9.1 years ago by mxs ▴ 530

1

Entering edit mode

After 4.1 years...

Hi @mxs, thanks for the script. For some reason, I was getting an error in the result. I modified your script slightly. Below are the steps:

$ cat test2.txt
>NZ_CP025963.1
ATGCAGTNNNACGTGCATGACTGTACGTANGTACGTGACTGACTGACTGACTNACTGACTGATCGTACNTACGTAC

$ perl -ne 'chomp;if( />(.*)/){$head = $1; $i=0; next};@a=split("",$_); foreach(@a){$i++;if($_ eq "N" && $s ==0 ){print "$head\t$i"; $s =1}elsif($s==1 && $_ ne "N"){print "\t$i\n";$s=0}}' test2.txt
NZ_CP025963.1   8   11
NZ_CP025963.1   30  31
NZ_CP025963.1   53  54
NZ_CP025963.1   69  70

$ cat >test2.bed
NZ_CP025963.1   8   11
NZ_CP025963.1   30  31
NZ_CP025963.1   53  54
NZ_CP025963.1   69  70

$ bedtools getfasta -fi test2.txt -bed test2.bed
index file test2.txt.fai not found, generating...
>NZ_CP025963.1:8-11
NNA
>NZ_CP025963.1:30-31
G
>NZ_CP025963.1:53-54
A
>NZ_CP025963.1:69-70
T

Instead of giving me the regions of N's, oneliner have extracted other regions also. Or may be it is because bedtools is zero based system.

Modified version:

$ perl -ne 'chomp;if( />(.*)/){$head = $1; $i=0; next};@a=split("",$_); foreach(@a){$i++; if($_ eq "N" && $s ==0 ){$z=$i-1; print "$head\t$z"; $s =1}elsif($s==1 && $_ ne "N"){$j=$i-1;print "\t$j\n";$s=0}}' test2.txt
NZ_CP025963.1   7   10 ("7" here is zero based coordinate and "10" is 1 based coordinate)
NZ_CP025963.1   29  30
NZ_CP025963.1   52  53
NZ_CP025963.1   68  69

$ bedtools getfasta -fi test2.txt -bed test2.bed
>NZ_CP025963.1:7-10
NNN
>NZ_CP025963.1:29-30
N
>NZ_CP025963.1:52-53
N
>NZ_CP025963.1:68-69
N

More info on zerobased and one-based system is here.

ADD REPLY • link 5.0 years ago by Prakki Rama ★ 2.7k

3

Entering edit mode

Though this is a nice oneliner it is very slow on large files.

Although I did not search for a particular fast way to do this it took that much time and such a huge amount of memory(!) that I decided to quickly write it in python. It does use Biopython though.

#!/usr/bin/python3.6
import sys

#import the SeqIO module from Biopython
from Bio import SeqIO
with open(sys.argv[1], mode="r") as fasta_handle:
for record in SeqIO.parse(fasta_handle, "fasta"):
    start_pos=0
    counter=0
    gap=False
    gap_length = 0
    for char in record.seq:
        if char == 'N':
            if gap_length == 0:
                start_pos=counter
                gap_length = 1
                gap = True
            else:
                gap_length += 1
        else:
            if gap:
                printrecord.id + "\t" + str(start_pos) + "\t" + str(start_pos + gap_length))
                gap_length = 0
                gap = False
        counter += 1

ADD REPLY • link 4.9 years ago by ronaldnieuwenhuis ▴ 30

0

Entering edit mode

Thanks! this worked awesome!

I had to reindent it a little but thats it:

#!/usr/bin/env  python3

import sys

# import the SeqIO module from Biopython
from Bio import SeqIO

with open(sys.argv[1], mode="r") as fasta_handle:
    for record in SeqIO.parse(fasta_handle, "fasta"):
        start_pos = 0
        counter = 0
        gap = False
        gap_length = 0
    for char in record.seq:
        if char == "N":
            if gap_length == 0:
                start_pos = counter
                gap_length = 1
                gap = True
            else:
                gap_length += 1
        else:
            if gap:
                print(
                    record.id
                    + "\t"
                    + str(start_pos)
                    + "\t"
                    + str(start_pos + gap_length)
                )
                gap_length = 0
                gap = False
        counter += 1

ADD REPLY • link 3.2 years ago by glich • 0