Hard Masked + Raw Fasta To Soft Masked Fasta
2
1
Entering edit mode
12.1 years ago
Darked89 4.6k

Hello,

I downloaded from ENSEMBL either unmasked genomic sequences or hard masked (Ns for repeats) genomic sequences. There is no soft masked sequence in sight on ENSEMBL ftp server.

Flipping upper cases to lower cases for each position is quite straightforward (see below), but it takes long time for mammalian genome. Simple question: how to do it faster, be it in Python or any other language?

#!/usr/bin/env python

from pyfasta import Fasta

masked_fasta = Fasta('test10k.rm.fa')
unmask_fasta = Fasta('test10k.fa')


for seqid in unmask_fasta.keys():
  print ">" + seqid

  unmasked_seq = unmask_fasta[seqid]
  masked_seq   = masked_fasta[seqid]

  output_seq = ""
  for position in range(0, len(unmasked_seq)):
    if masked_seq[position] == "N":
      base =  unmasked_seq[position].lower()
    else:
      base  =  unmasked_seq[position]
    output_seq += base
  print output_seq
python code • 4.3k views
ADD COMMENT
1
Entering edit mode
12.1 years ago
brentp 24k

pyfasta lets you turn the sequence into a numpy array so you can avoid loops. I think this is close to what you want to do:

from pyfasta import Fasta
import numpy as np
import sys

fa = Fasta(sys.argv[1]) # sequence with N's
fb = Fasta(sys.argv[2]) # sequence with replacements

for seqid, aseq in fa.iteritems():

    aseq = np.array(aseq)
    bseq = np.array(str(fb[seqid]).lower(), dtype="c")

    cseq = np.where(aseq == "N", bseq, aseq)

    print ">%s\n%s" % (seqid, cseq.tostring())
ADD COMMENT
0
Entering edit mode

This really speed things up. I went from ca 4hrs to 40mins. Thank you!

ADD REPLY
1
Entering edit mode
12.1 years ago
Gustavo ▴ 530

Slightly sidestepping your question, and assuming your genome of interest is among those provided...

--> http://repeatmasker.org/PreMaskedGenomes.html

ADD COMMENT
0
Entering edit mode

Thanks, indeed it is there, but it is an older version.

ADD REPLY

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6