how to find a region of sequence based on region
2
0
Entering edit mode
5.3 years ago
Learner ▴ 280

I am trying to find a region in a sequence . lets say I have this sequence

>sp|P14306|CPYI_YEAST Carboxypeptidase Y inhibitor OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TFS1 PE=1 SV=2
MNQAIDFAQASIDSYKKHGILEDVIHDTSFQPSGILAVEYSSSAPVAMGNTLPTEKARSK
PQFQFTFNKQMQKSVPQANAYVPQDDDLFTLVMTDPDAPSKTDHKWSEFCHLVECDLKLL
NEATHETSGATEFFASEFNTKGSNTLIEYMGPAPPKGSGPHRYVFLLYKQPKGVDSSKFS
KIKDRPNWGYGTPATGVGKWAKENNLQLVASNFFYAETK

I want to find the amino acid number of NTLIEYMGPAPPKGSG and make this region lower case

Or remove the amino acid of this 144 to 159 from the sequence

I made a better example

lets call this one df1.txt

>tr|A0A1B1L9R9|A0A1B1L9R9_BACTU ABC transporter permease OS=Bacillus thuringiensis OX=1428 GN=berB PE=4 SV=1
MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVG
MESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIA
ITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFF
FSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIW
PVSIGLIAGALCIFIVAIVVFKKRDLPL
>sp|O15304|SIVA_HUMAN Apoptosis regulatory protein Siva OS=Homo sapiens OX=9606 GN=SIVA1 PE=1 SV=2
MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDE
GCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRA
VDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET

I have a second text file like below lets call it df2

>sp|O15304|SIVA_HUMAN (87-93) Best selected with ratio of 10 , 12 and 14
IGPDGR
>tr|A0A1B1L9R9|A0A1B1L9R9_BACTU (135-168) Not selected with ratio of (12,14)
NKELFLKINIVGGLIFLVVSAYSFF
FSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWM

I want to merge them both in one file. So few keys are into merging them The output, I am trying to make is like below

>tr|A0A1B1L9R9|A0A1B1L9R9_BACTU ABC transporter permease OS=Bacillus thuringiensis OX=1428 GN=berB PE=4 SV=1 (135-168) Not selected with ratio of (12,14)
MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVG
MESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIA
ITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLnkelflkinivgggliflvvsaysf
ffscicnderkalsysasltilffvldmvgklewmKNLSLFTLFRPKEIAEGAYNIW
PVSIGLIAGALCIFIVAIVVFKKRDLPL
>sp|O15304|SIVA_HUMAN Apoptosis regulatory protein Siva OS=Homo sapiens OX=9606 GN=SIVA1 PE=1 SV=2 (87-93) Best selected with ratio of 10 , 12 and 14
MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDE
GCAVVHLPESPKPGPTGAPRAARGQMLigpdgrLIRSLGQASEADPSGVASIACSSCVRA
VDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET

So, first we look at the df1.txt, we find a name like |A0A1B1L9R9| we match the data in the df2.txt and it has the same name. The we find the region that is specific in df2 in parentheses like (87-93) and make it lower case

protein gene • 1.8k views
ADD COMMENT
0
Entering edit mode

@Alex Reynolds I do appreciate your help very much. I am just trying to figure out how to solve it. it has taken so much time from me. I really did not mean to offend you by any mean. You Rock. I appreciate your help for every single word you say. I wish I was able to communicate with you through email. Do you think it is possible?

ADD REPLY
1
Entering edit mode

I followed your post from the start. First, you didn't describe your problem properly, and at each solution Alex Reynolds provided, you added more requirements. This is terrible, because it wastes both your and his time. Be thoughtful when asking questions, lay out the problem in detail from the start, and provide an example of the input data and intended output. A great resource is Tutorial: How To Ask Good Questions On Technical And Scientific Forums.

@Alex Reynolds I do appreciate your help very much.

If you do, upvote his answers / comments, when useful. It is the best way to show appreciation. And his answer and comments were useful - had you asked a good question from the beginning, I bet he would solve your problem right away.

ADD REPLY
1
Entering edit mode

You should use shenwei's answer. I've never been a huge fan of frameworks to handle text files, but his kit is good stuff.

ADD REPLY
0
Entering edit mode

@Alex Reynolds ok thanks

ADD REPLY
0
Entering edit mode

You're very welcome!

ADD REPLY
2
Entering edit mode
5.3 years ago

One solution using SeqKit and shell commands join and awk.

  1. Converting fasta files to tab-delimited format. Only FASTA ID instead of whole header is kept. Sorting is needed if you want join using shell command join later.

    $ cat df1.fa | seqkit sort | seqkit fx2tab -i > df1.fa.tsv
    $ cat df2.fa | seqkit sort | seqkit fx2tab -i > df2.fa.tsv
    
    $ cat df1.fa.tsv
    sp|O15304|SIVA_HUMAN    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET
    tr|A0A1B1L9R9|A0A1B1L9R9_BACTU  MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL
    
  2. Joining two TSV files. Using csvtk if IDs of records in the two files are not equal.

    # or:
    #   csvtk join -H -t df1.fa.tsv df2.fa.tsv 
    
    $ join df1.fa.tsv df2.fa.tsv        
    sp|O15304|SIVA_HUMAN    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET         IGPDGR
    tr|A0A1B1L9R9|A0A1B1L9R9_BACTU  MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL            NKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWM
    
  3. Converting subsequence (3rd column) found in sequence (2nd column) to lowercase, and removing 3rd column.

    $ join df1.fa.tsv df2.fa.tsv \
        | awk '{ gsub($3, tolower($3)); print $1"\t"$2}'
    
    sp|O15304|SIVA_HUMAN    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLigpdgrLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET
    tr|A0A1B1L9R9|A0A1B1L9R9_BACTU  MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLnkelflkinivggliflvvsaysfffscicnderkalsysasltilffvldmvgklsdklewmKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL
    
  4. Converting TSV back to FASTA format.

    $ join df1.fa.tsv df2.fa.tsv \
        | awk '{ gsub($3, tolower($3)); print $1"\t"$2}' \
        | seqkit tab2fx \
        > df1.mask.fa
    
    $ cat df1.mask.fa
    >sp|O15304|SIVA_HUMAN
    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDE
    GCAVVHLPESPKPGPTGAPRAARGQMLigpdgrLIRSLGQASEADPSGVASIACSSCVRA
    VDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET
    >tr|A0A1B1L9R9|A0A1B1L9R9_BACTU
    MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVG
    MESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIA
    ITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLnkelflkinivggliflvvsaysff
    fscicnderkalsysasltilffvldmvgklsdklewmKNLSLFTLFRPKEIAEGAYNIW
    PVSIGLIAGALCIFIVAIVVFKKRDLPL
    
  5. (Optional) If you want keep the full header.

    $ seqkit seq -n df1.fa | sed 's/ /\t/' > id2head.tsv
    
    $ cat df1.mask.fa | seqkit replace -k id2head.tsv -p '(.+)' -r '$1 {kv}' 
    >sp|O15304|SIVA_HUMAN Apoptosis regulatory protein Siva OS=Homo sapiens OX=9606 GN=SIVA1 PE=1 SV=2
    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDE
    GCAVVHLPESPKPGPTGAPRAARGQMLigpdgrLIRSLGQASEADPSGVASIACSSCVRA
    VDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET
    >tr|A0A1B1L9R9|A0A1B1L9R9_BACTU ABC transporter permease OS=Bacillus thuringiensis OX=1428 GN=berB PE=4 SV=1
    MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVG
    MESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIA
    ITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLnkelflkinivggliflvvsaysff
    fscicnderkalsysasltilffvldmvgklsdklewmKNLSLFTLFRPKEIAEGAYNIW
    PVSIGLIAGALCIFIVAIVVFKKRDLPL
    
ADD COMMENT
0
Entering edit mode

@shenwei356 when I use this

join df1.fa.tsv df2.fa.tsv | awk '{gsub($3, tolower($3)); print $1"\t"$2}'

it does not work, it seems like there must be both data having the same number of accession , other wise it won't work

ADD REPLY
0
Entering edit mode

Step 2. Joining two TSV files. Using csvtk if IDs of records in the two files are not equal

ADD REPLY
0
Entering edit mode

@shenwei356 I am getting this error [INFO] read key-value file: id2head.tsv [ERRO] no valid data in key-value file: id2head.tsv

ADD REPLY
0
Entering edit mode

checking is id2head.tsv blank.

if it is, paste result of

seqkit seq -n df1.fa | head -n 5 | cat -A

if not, paste result of

head -n 5 id2head.tsv | cat -A
ADD REPLY

Login before adding your answer.

Traffic: 1991 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6