Question

Extracting N positions from fasta file

2

Entering edit mode

7.4 years ago

samuel.lipworth ▴ 30

Hi I am very new to this so appologies if this is a simple question - I have been trying to figure this out for days to no avail and my python skills are not quite there yet!

I am trying to extract all N positions from novel bacterial sequences which have been aligned to a member of the same genus. I would like the start and end positions of all N motifs eg. GTCAGNNNNNTGGT

Is there an existing tool / how could I go about creating this in python?

Many thanks.

sequence • 4.7k views

ADD COMMENT • link updated 2.2 years ago by Pierre Lindenbaum 161k • written 7.4 years ago by samuel.lipworth ▴ 30

0

Entering edit mode

try regular expression: https://docs.python.org/2/library/re.html

ADD REPLY • link 7.4 years ago by ahmedakhokhar ▴ 150

0

Entering edit mode

Do you want to search for a specific pattern or for every location in which an 'N' is present?

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

every location at which a N is present. I have something like 36 different strains and would like to produce a list of N locations in each FASTA file and then compare these lists to find the unique N locations for each strain.

ADD REPLY • link 7.4 years ago by samuel.lipworth ▴ 30

0

Entering edit mode

So the output would be the chromosomal locations, right? Sounds like a job that can be done using Biopython. What have you tried?

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

If the sequence is not long you can do it without software, open the file fasta by the wordpad

ADD REPLY • link 7.4 years ago by nora ▴ 40

0

Entering edit mode

That's not very helpful.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

thanks but the sequences are > 4 million bases

ADD REPLY • link 7.4 years ago by samuel.lipworth ▴ 30

2

Entering edit mode

Just use SeqKit. shenwei356 has even provided a detailed example below.

ADD REPLY • link 7.4 years ago by GenoMax 141k

score 7 · Answer 1 · 2016-11-28

Use SeqKit. SeqKit supports Windows/Linux/Mac OS X.

$ echo -en '>seq\nGTCAGNNNNNTGGT\n' | seqkit locate --ignore-case --only-positive-strand --pattern "N+" 
seqID   patternName     pattern strand  start   end     matched
seq     N+      N+      +       6       10      NNNNN

aligned result

$ echo -en '>seq\nGTCAGNNNNNTGGT\n' | seqkit locate --ignore-case --only-positive-strand --pattern "N+" | column -t
seqID  patternName  pattern  strand  start  end  matched
seq    N+           N+       +       6      10   NNNNN

or from file and save to file

$ seqkit locate --ignore-case --only-positive-strand --pattern "N+" seqs.fa > result.xls

score 0 · Answer 2 · 2022-02-08

picard ScatterIntervalsByNs https://broadinstitute.github.io/picard/command-line-overview.html#ScatterIntervalsByNs

Writes an interval list based on splitting a reference by Ns. This tool identifies positions in a reference where the bases are 'no-calls' and writes out an interval-list using the resulting coordinates. This can be used to create an interval list for whole genome sequence (WGS) for e.g. scatter-gather purposes, as an alternative to using fixed-length intervals. The number of contiguous nocalls that can be tolerated before creating a break is adjustable from the command line.