Remove gap from files
1
0
Entering edit mode
6.6 years ago
skjobs1234 ▴ 40

I want remove the long unaligned gap from the files with position wise. I have input life this files

 >f1
--------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------GTTYGVC
SKAFKFLGTPADTGHGTVVLELQYTGTDGPCKVPISSVASLNDLTPVGRLVTVNPFVSVA

>f2
--------------------------------------------------------------------------------------------------------------------
-----------------------------MVHRQWFFDLPLPWA-----------------------------------------GTTYGMC
TEKFSFAKNPADTGHGTVVIELSYSGSDGPCKIPIVSVASLNDMTPVGRLVTVNPFVATS

>f3
MRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLC
IEGKITNITTDSRCPTQGEAVLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQ
CLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGGG

And I want the output like this

>f1
------------------------------------------------------------------------------------------------------GTTYGVC
SKAFKFLGTPADTGHGTVVLELQYTGTDGPCKVPISSVASLNDLTPVGRLVTVNPFVSVA

>f2
-----------------------------MVHRQWFFDLPLPWA-----------------------------------------GTTYGMC
TEKFSFAKNPADTGHGTVVIELSYSGSDGPCKIPIVSVASLNDMTPVGRLVTVNPFVATS

>f3
IEGKITNITTDSRCPTQGEAVLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQ
CLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGGG

So equal number of gaps are removed from f1 and f2 gaps as well as from f3 remove the one line on the basis of position removed from f1 and f2.

Thanks in advance

alignment sequence • 3.2k views
ADD COMMENT
5
Entering edit mode

I want remove the long unaligned gap from the files with position wise

why ? sounds like http://xyproblem.info/

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

I have used this Gblocks and trimal but both are not suitable for your data set. Please help how to write the perl or python script

ADD REPLY
0
Entering edit mode

I want remove the long unaligned gap from the files with position wise

Do you know the positions of the gaps to remove? Are they based on the blast result?

ADD REPLY
0
Entering edit mode

No. Please guide Yes.. On the based on blast result. I have got the aligned file PROMAL3D to get aligned file.. yes the position is known.

ADD REPLY
0
Entering edit mode

provide examples of the positions in which to remove gaps

ADD REPLY
0
Entering edit mode

Example Input file

>f1

------------------------------------------------------------------------------------------------------GTTYGVC SKAFKFLGTPADTGHGTVVLELQYTGTDGPCKVPISSVASLNDLTPVGRLVTVNPFVSVA

>f2

-----------------------------MVHRQWFFDLPLPWA-----------------------------------------GTTYGMC TEKFSFAKNPADTGHGTVVIELSYSGSDGPCKIPIVSVASLNDMTPVGRLVTVNPFVATS

f3 MRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLC IEGKITNITTDSRCPTQGEAVLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQ CLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGGG

Outpur file example

f1 ------------------------------------------------------------------------------------------------------GTTYGVC SKAFKFLGTPADTGHGTVVLELQYTGTDGPCKVPISSVASLNDLTPVGRLVTVNPFVSVA

f2 -----------------------------MVHRQWFFDLPLPWA-----------------------------------------GTTYGMC TEKFSFAKNPADTGHGTVVIELSYSGSDGPCKIPIVSVASLNDMTPVGRLVTVNPFVATS

f3 IEGKITNITTDSRCPTQGEAVLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQ CLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGGG

ADD REPLY
1
Entering edit mode

why you keep the gaps at the beginning of f1 and f2 output? how many gaps do you want to keep?

ADD REPLY
0
Entering edit mode

This does not look like you want to remove gaps but just remove the new line characters on the fasta header line.

ADD REPLY
0
Entering edit mode

This question is unrelated to Perl or Blast so I removed those tags.

ADD REPLY
0
Entering edit mode

Hello skjobs1234!

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/2524/r

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
3
Entering edit mode
6.6 years ago
shoujun.gu ▴ 380
  1. save the following code in a file called biostar.py (or any other name)
  2. in shell, run: python3 biostar.py input_file output_file
  3. note this code may not work on really big file due to the memory problem

code:

import sys

inp=sys.argv[1]
output=sys.argv[2]

count=0
tempcount=0
with open(inp, 'r') as file:
    for line in file:
        if line[0]=='>':
            if count<tempcount:
                count=tempcount
            tempcount=0
        elif set(line)=={'-','\n'}:
            tempcount=tempcount+1
if count<tempcount:
    count=tempcount

with open(inp, 'r') as file2:
    lines=file2.read().split('>')[1:]

list=[]
i=count+1
for fa in lines:
    fasta=fa.split('\n')
    list=list+['>'+fasta[0]]+fasta[i:]
list=[i+'\n' for i in list if i]

with open(output, 'w') as out:
    out.writelines(list)
ADD COMMENT
0
Entering edit mode

I'm getting this error on Terminal

**Traceback (most recent call last):

File "biostar.py", line 3, in <module>

inp=sys.argv[1]

IndexError: list index out of range**

ADD REPLY
1
Entering edit mode

did you provide the input and output file's absolute path in the command line? or could you copy the command you run here?

ADD REPLY
0
Entering edit mode

Yes.. It's working.. Thank for your valuable suggestion and time

ADD REPLY
0
Entering edit mode

Dear Shoujun your script is running well. But the out is not coming which i want. Please see once more this problem. Thanking for helping me

Actually your script is not removing gap properly, I want to remove only unaligned (gap) region from > files. I want to give a brief example with output

tem1------------------------------------------------------FHLTTR GGEPHMIVSKQERGKSLLFKTSAGVNMCTLIAMDLGELCEDTMTYKCPRITETEPDDVDC

tem2------------------------------------------------------------ --------------------------TPVECFEPSMLKKKQLTVLDLHPG-G-KTRRVLP

query MNNQRKKTGKPSINMLKRVRNRVSTGSQLAKRFSKGLLNGQGPMKLVMAFIAFLRFLAIP PTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINQRKKTSLCLMMILPAALAFHLTSR

I want output like this

tem1 FHLTTRGGEPHMIVSKQERGKSLLFKTSAGVNMCTLIAMDLGELCEDTMTYKCPRITETEPDDVDC tem2-------------------------------------TPVECFEPSMLKKKQLTVLDLHPG-G-KTRRVLP query MNNQRPTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINQRKKTSLCLMMILPAALAFHLTSR

Condition 1 So you can see here, if the gaps are found at same position in both tem1 & tem2 then remove gaps and also remove query (it doesn't matter gap or not but it should be remove equal number of sequences with same position)

Condition 2 if the if gaps found more than 20 residues in all >tem at the same position then it should remove.

ADD REPLY
1
Entering edit mode

The format of your given example is messed up. I cannot really understand how you want to remove the gap.

Could you just upload as fig?

ADD REPLY
0
Entering edit mode

Yes I can

please give me email ID?

You can ping on my email id

redacted (see below)

ADD REPLY
1
Entering edit mode

You can upload to the forum. Or send to my email: redacted

ADD REPLY
0
Entering edit mode

Personal email addresses are not shared on Biostars.

ADD REPLY
0
Entering edit mode

Please use the "101" button in the edit window to apply code formatting to you text. Do not use spaces and (") icon to format text. I tried to format your example above but can't make exact sense of what you have/need.

ADD REPLY
0
Entering edit mode

I want to remove All Positions Containing A Gap In A Multiple Alignment by perl, python or any other scripting language

ADD REPLY
0
Entering edit mode

If you want to remove all gaps from the file:

sed -i 's/-//g' input.fasta

Otherwise, you need to know the positions, for specific gaps if you're not removing all of them. The OP is vague on how you intend to identify gaps to be removed.

ADD REPLY
0
Entering edit mode

This script is working well, but I want to remove gap one by one, not directly line by line. Can you modify this program into one by one if the gap is continue more than 20-25

ADD REPLY
0
Entering edit mode

Please can you modify this script? This script is removing gap line by line. But I would like to remove the gap one by one string.
This script is removing line by line gap. But I want to remove the gap if the gap is more than 20. It can be possible by the array. Take every gap into in a array and match to the other file. If found the gap in all files then remove otherwise skip if found in any one.

ADD REPLY
1
Entering edit mode

You should try and give it a shot and post your version here for feedback.

ADD REPLY
1
Entering edit mode

import sys

inp=sys.argv[1]

output=sys.argv[2]

count=0

tempcount=0

with open(inp, 'r') as file:

for line in file:

    if line[0]=='>':

        if count<tempcount:

            count=tempcount

        tempcount=0

    elif set(line)=={'-','\n'}:

        tempcount=tempcount+1

if count<tempcount:< p="">

count=tempcount

with open(inp, 'r') as file2:

lines=file2.read().split('>')[1:]

list=[]

i=count+1

for fa in lines:

fasta=fa.split('\n')

list=list+['>'+fasta[0]]+fasta[i:]

list=[i+'\n' for i in list if i]

with open(output, 'w') as out:

out.writelines(list)

This script is removing line by line gap (-). But I would like to remove the gap one string by string. if more than 20 continuous string found in all input template file then remove the gap, otherwise if not found any gap at same position in any template the skip. Example of Input

f1 ------------------------------------------------------------------------------------------------------GTTYGVCSKAFKFLGTPADTGHGTVVLELQYTGTDGPCKVPISSVASLNDLTPVGRLVTVNPFVSVA

f2 -----------------------------MVHRQWFFDLPLPWA-----------------------------------------GTTYGMCTEKFSFAKNPADTGHGTVVIELSYSGSDGPCKIPIVSVASLNDMTPVGRLVTVNPFVATS

Query IEGKITNITTDSRCPTAVLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGGG

Output should be like

f1 --------------------------------GTTYGVCSKAFKFLGTPADTGHGTVVLELQYTGTDGPCKVPISSVASLNDLTPVGRLVTVNPFVSVA

f2 MVHRQWFFDLPLPWAGTTYGMCTEKFSFAKNPADTGHGTVVIELSYSGSDGPCKIPIVSVASLNDMTPVGRLVTVNPFVATS

f3 AVLPEEQDQNYVCKHTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGGG

ADD REPLY

Login before adding your answer.

Traffic: 1610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6