Question

Fasta Length

1

Entering edit mode

10.6 years ago

k.nirmalraman ★ 1.1k

I was wondering if there are any pre built packages in R or Python/ scripts, where one can extract FASTA sequences that are longer than N bases or shorter or exactly N bases from a multi fasta file.

fasta length • 7.9k views

ADD COMMENT • link updated 10.6 years ago by brentp 24k • written 10.6 years ago by k.nirmalraman ★ 1.1k

1

Entering edit mode

I believe biopieces can do that as well https://code.google.com/p/biopieces/wiki/HowTo#Howto_filter_sequences_on_length

ADD REPLY • link 10.6 years ago by Biomonika (Noolean) 3.2k

score 5 · Answer 1 · 2013-09-16

5

Entering edit mode

10.6 years ago

Ashutosh Pandey 12k

I think this post should help you.

How to Filter Multi fasta by length??

ADD COMMENT • link 10.6 years ago by Ashutosh Pandey 12k

score 3 · Answer 2 · 2013-09-16

3

Entering edit mode

10.6 years ago

brentp 24k

https://github.com/lh3/bioawk

bioawk -c fastx '(length($seq) > 60){ print ">"$name"\n"$seq }' input.fasta > output.fasta

ADD COMMENT • link 10.6 years ago by brentp 24k

score 2 · Answer 3 · 2013-09-16

2

Entering edit mode

10.6 years ago

Neilfws 49k

One option for R is the seqinr package.

Use read.fasta() to read in a multi-sequence fasta file:

library(seqinr)
f <- read.fasta("myfile.fa")  # default is DNA; add ", seqtype = "AA" for protein

Use summary() for a range of sequence statistics, including length:

summary(f[[1]])  # first sequence
summary(f[[1]])$length  # length of first sequence

Use lapply() to get sequences longer than N bases:

f1 <- f[which(lapply(f, function(x) summary(x)$length) > 60)]  # N > 60
# ignore warnings; they relate to sequence composition

There is also sure to be a R/Bioconductor solution using Biostrings.

ADD COMMENT • link 10.6 years ago by Neilfws 49k

1

Entering edit mode

Shortened version using the getLength() dedicated function:

f1 <- f[which(getLength(f)>60)]

ADD REPLY • link 10.6 years ago by Manu Prestat 4.1k

0

Entering edit mode

Nice; I always forget the get*() functions in seqinR.

ADD REPLY • link 10.6 years ago by Neilfws 49k