Trimming gapped ends from a multiple alignment
3
0
Entering edit mode
8.3 years ago

Hi! I have a following problem. I have a multiple alignment that looks something like that:

SpeciesA --GTACCTAGGTACCT
SpeciesB AGGT---TAGGTACCT
SpeciesC AGGTACCTAGGT----

What I want to get is the following:

SpeciesA GTACCTAGGT
SpeciesB GT---TAGGT
SpeciesC GTACCTAGGT

i.e. gaps at the ENDS of the alignment trimmed off, the internal gaps preserved and all the sequences of the same length.

I've been looking everywhere to find a tool to help me but no luck so far. Most of the programmes like Gblocks, trimAl, t-coffee and similar can remove gapped columns but throughout the entire alignment. Due to that I lose the internal indels and the resulting alignment is like that:

SpeciesA GTTAGGT
SpeciesB GTTAGGT
SpeciesC GTTAGGT

Which I'm not interested in.

I know I can do it manually but I have about 5,000 alignment files... If I spent a minute on editing each of them I'd probably need 3,5 days of manual editing non-stop... That's way I'd appreciate any help VERY VERY much!

Cheers

alignment • 4.6k views
ADD COMMENT
0
Entering edit mode

You can edit the alignment and delete columns manually in Jalview and ClustalX.

ADD REPLY
0
Entering edit mode

Pretty sure Stack Overflow could give you a great awk one-liner in less than one minute. If they do, please post it back here as well :)

ADD REPLY
1
Entering edit mode
8.3 years ago
5heikki 11k

I wrote a very ugly Bash script that should do what you want (not tested at all, assumes there are no spaces in names and names are separated from the alignments by a single space). Name it e.g. trimEnds.sh, then chmod +x trimEnds.sh and use it by ./trimEnds.sh alignmentFile

#!/bin/bash
#separate the alignment from the names
cut -f2 -d " " $1 > $1.ali
cut -f1 -d " " $1 > $1.names

#Set start at 1
START=1

#Set end at alignment length
END=$(head -n1 $1.ali | awk '{print length}')

#Find out where the good stuff starts
for (( i=$START; i<=$END; i++ ))
    do
    nonLetterCount=$(cut -c$i $1.ali | tr -d [A,T,G,C] | grep . | grep -c .)
    if [ "$nonLetterCount" -gt "0" ]
        then
        (( START = $i + 1 ))
    else
        break
    fi
    done

#Find out where the good stuff ends
for (( i=$END; i>=$START; i-- ))
        do
    nonLetterCount=$(cut -c$i $1.ali | tr -d [A,T,G,C] | grep . | grep -c .)
        if [ "$nonLetterCount" -gt "0" ]
                then
                (( END = $i - 1 ))
        else
                break
        fi
    done

#Reunite names with seqs
paste -d " " $1.names <(cut -c$START-$END $1.ali) > $1.mod

#Remove unnecessary files
rm $1.names $1.ali
ADD COMMENT
0
Entering edit mode

THANKS A LOT! The script works for me and does exactly what I wanted. I run it on a few different cases of alignments and worked like a dream on each and every one of them. I just had to play a little bit with the formats but wasn't difficult. You save me a lot of time and nerves!

Thanks again for your time and effort!

ADD REPLY
0
Entering edit mode

Hi

Now I have the same problem. I tried using the shell script but I am getting the error. "cut: invalid decreasing range" What would be the reason?

ADD REPLY
0
Entering edit mode

My guess is that your data is not in the same format as OP's.

ADD REPLY
0
Entering edit mode

Oh.. Maybe.. But I edited to be one line fasta like in example. Still the same result.. Thank you

ADD REPLY
0
Entering edit mode

Hi I am trying this script but I got this error cut: invalid byte, character or field list. any suggestion what is the problem?

ADD REPLY
0
Entering edit mode

I am trying it but I have this error cut: invalid byte, character or field list do you have any suggestion

ADD REPLY
0
Entering edit mode
8.3 years ago
Pappu ★ 2.1k

You can edit the alignment and delete columns manually in Jalview and ClustalX.

ADD COMMENT
0
Entering edit mode

Only if I didn't have more than 5,000 files to edit... I'm looking for a way to avoid doing this manually.

ADD REPLY
0
Entering edit mode
7.2 years ago
Suzanne ▴ 100

Here is link to a Jalview video about editing sequences: https://youtu.be/ZI4NW6kQeHc. It is part of 'Selecting and Editing Sequences' in Jalview YouTube playlist https://www.youtube.com/playlist?list=PLpU3VZmUmrT27N2Uy32qszCOoznvN-07N.

ADD COMMENT

Login before adding your answer.

Traffic: 3152 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6