Fast way to extract hg19 sequences with Biopython?

0

Entering edit mode

7.8 years ago

nchuang ▴ 260

I am trying to extract a list of 5kb sequences from hg19 genome. However it takes a very long time to Bio.SeqIO.parse() all of the genome into memory, and even using Bio.SeqIO.index() also takes a long time as well.

What is a fast way to do this or this is a limitation of python?

I'm waiting for my admin to install pyfaidx for me and I will see how that one does too.

python biopython • 2.5k views

ADD COMMENT • link 7.8 years ago by nchuang ▴ 260

1

Entering edit mode

You could use samtools faidx in the meantime: Getting Sequence Based On Chromosome No And Coordinates From Whole Genome Fasta File

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

wow that looks so simple. I will try to subprocess it. Thanks!

ADD REPLY • link 7.8 years ago by nchuang ▴ 260

1

Entering edit mode

Try pip install --user pyfaidx. Then you should have the faidx script in $HOME/.local/bin.

ADD REPLY • link 7.8 years ago by Matt Shirley 10k

0

Entering edit mode

I don't have root and pip is not even installed on the default version 2.4. They did install 3.4 for me but I had to add it to path. I don't think I have pip for that install either.

ADD REPLY • link 7.8 years ago by nchuang ▴ 260

Login before adding your answer.