Fasta Length
3
1
Entering edit mode
10.6 years ago
k.nirmalraman ★ 1.1k

I was wondering if there are any pre built packages in R or Python/ scripts, where one can extract FASTA sequences that are longer than N bases or shorter or exactly N bases from a multi fasta file.

fasta length • 7.9k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
5
Entering edit mode
10.6 years ago

I think this post should help you.

How to Filter Multi fasta by length??

ADD COMMENT
3
Entering edit mode
10.6 years ago
brentp 24k

https://github.com/lh3/bioawk

bioawk -c fastx '(length($seq) > 60){ print ">"$name"\n"$seq }' input.fasta > output.fasta
ADD COMMENT
2
Entering edit mode
10.6 years ago
Neilfws 49k

One option for R is the seqinr package.

Use read.fasta() to read in a multi-sequence fasta file:

library(seqinr)
f <- read.fasta("myfile.fa")  # default is DNA; add ", seqtype = "AA" for protein

Use summary() for a range of sequence statistics, including length:

summary(f[[1]])  # first sequence
summary(f[[1]])$length  # length of first sequence

Use lapply() to get sequences longer than N bases:

f1 <- f[which(lapply(f, function(x) summary(x)$length) > 60)]  # N > 60
# ignore warnings; they relate to sequence composition

There is also sure to be a R/Bioconductor solution using Biostrings.

ADD COMMENT
1
Entering edit mode

Shortened version using the getLength() dedicated function:

f1 <- f[which(getLength(f)>60)]

ADD REPLY
0
Entering edit mode

Nice; I always forget the get*() functions in seqinR.

ADD REPLY

Login before adding your answer.

Traffic: 1526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6