How to select names with wildcard from AAStringSet (biostrings)
2
0
Entering edit mode
4.9 years ago
Benn 8.3k

Maybe a simple question, hopefully I get a simple answer.

I have imported a fasta file of proteins in R with biostrings.

> mySeqs <- readAAStringSet(fastaFile)   # from package Biostrings
> mySeqs
  A AAStringSet instance of length 318
      width seq                                             names               
  [1]    98 QVQLVQSGAEVKKPGASVKVSC...STAYMELRSLRSDDTAVYYCAR IGHV1-18*01
  [2]    92 QVQLVQSGAEVKKPGASVKVSC...TTDTSTSTAYMELRSLRSDDTA IGHV1-18*02
  [3]    98 QVQLVQSGAEVKKPGASVKVSC...STAYMELRSLRSDDMAVYYCAR IGHV1-18*03
  [4]    98 QVQLVQSGAEVKKPGASVKVSC...STAYMELRSLRSDDTAVYYCAR IGHV1-18*04
  [5]    98 QVQLVQSGAEVKKPGASVKVSC...STAYMELSRLRSDDTVVYYCAR IGHV1-2*01
  ...   ... ...
[314]    91 QVQLVQSGSELKKPGASVKVSC...FSLDTSVSTAYLQISTLKAEDT IGHV7-4-1*03
[315]    98 QVQLVQSGSELKKPGASVKVSC...SMAYLQISSLKAEDTAVYYCAR IGHV7-4-1*04
[316]    98 QVQLVQSGSELKKPGASVKVSC...SMAYLQISSLKAEDTAVCYCAR IGHV7-4-1*05
[317]    98 QVQLVQSGHEVKQPGASVKVSC...STAYLQISSLKAEDMAMYYCAR IGHV7-81*01
[318]    98 EAQLTESGGDLVHLEGPLRLSC...YMLYMQMISLRTQNMAAFNCAG IGHV8-51-1*02

So how can I select only sequences with a certain name, lets say all names that contain "IGHV1". I have tried some grep, but that does not work on these objects.

Thanks in advance.

biostrings R • 2.8k views
ADD COMMENT
4
Entering edit mode
4.9 years ago
AK ★ 2.2k

Try:

mySeqs[grepl("IGHV1", mySeqs@ranges@NAMES)]

Example:

> mySeqs
  A AAStringSet instance of length 10
     width seq                                                                                names               
 [1]    44 QVQLVQSGAEVKKPGASVKVSCSTAYMELRSLRSDDTAVYYCAR                                       IGHV1-18*01
 [2]    44 QVQLVQSGAEVKKPGASVKVSCTTDTSTSTAYMELRSLRSDDTA                                       IGHV1-18*02
 [3]    44 QVQLVQSGAEVKKPGASVKVSCSTAYMELRSLRSDDMAVYYCAR                                       IGHV1-18*03
 [4]    44 QVQLVQSGAEVKKPGASVKVSCSTAYMELRSLRSDDTAVYYCAR                                       IGHV1-18*04
 [5]    44 QVQLVQSGAEVKKPGASVKVSCSTAYMELSRLRSDDTVVYYCAR                                       IGHV1-2*01
 [6]    44 QVQLVQSGSELKKPGASVKVSCFSLDTSVSTAYLQISTLKAEDT                                       IGHV7-4-1*03
 [7]    44 QVQLVQSGSELKKPGASVKVSCSMAYLQISSLKAEDTAVYYCAR                                       IGHV7-4-1*04
 [8]    44 QVQLVQSGSELKKPGASVKVSCSMAYLQISSLKAEDTAVCYCAR                                       IGHV7-4-1*05
 [9]    44 QVQLVQSGHEVKQPGASVKVSCSTAYLQISSLKAEDMAMYYCAR                                       IGHV7-81*01
[10]    44 EAQLTESGGDLVHLEGPLRLSCYMLYMQMISLRTQNMAAFNCAG                                       IGHV8-51-1*02
> mySeqs[grepl("IGHV1", mySeqs@ranges@NAMES)]
  A AAStringSet instance of length 5
    width seq                                                                                 names               
[1]    44 QVQLVQSGAEVKKPGASVKVSCSTAYMELRSLRSDDTAVYYCAR                                        IGHV1-18*01
[2]    44 QVQLVQSGAEVKKPGASVKVSCTTDTSTSTAYMELRSLRSDDTA                                        IGHV1-18*02
[3]    44 QVQLVQSGAEVKKPGASVKVSCSTAYMELRSLRSDDMAVYYCAR                                        IGHV1-18*03
[4]    44 QVQLVQSGAEVKKPGASVKVSCSTAYMELRSLRSDDTAVYYCAR                                        IGHV1-18*04
[5]    44 QVQLVQSGAEVKKPGASVKVSCSTAYMELSRLRSDDTVVYYCAR                                        IGHV1-2*01
> 
ADD COMMENT
0
Entering edit mode

Simpler than mine, +1. I always forget than Biostrings follows the same logic as IRanges/GenomicRanges when it comes to subsetting. You almost never need any apply based method :)

ADD REPLY
0
Entering edit mode

Great solution, thanks!

ADD REPLY
1
Entering edit mode
4.9 years ago
ATpoint 82k

With a dummy dataset:

require(Biostrings)

example.dna <- DNAStringSet(c(`IGHV1-18*01`="ACGGTTAGT", 
                              `IGHV1-18*02`="GTCGGGATA", 
                              `IGHV2-18*03`="GGGTCTAAA"))

example.aa <- translate(example.dna)

## Make a named list of AAStrings, then combine to AAStringSet:
out.aa <- AAStringSet(x = sapply(grep("IGHV1", names(example.aa)), 
                                 function(x) {
                                   tmp <- list(example.aa[[x]])
                                   names(tmp) <- names(example.aa)[x]
                                   return(tmp)
                                  })
)
ADD COMMENT
0
Entering edit mode

I'll go for the one-liner, but thanks anyways!

ADD REPLY

Login before adding your answer.

Traffic: 2846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6