Removing illegal Characters from an Alignment file for RaxML
1
1
Entering edit mode
6.1 years ago
apulunuj ▴ 30

Hi everyone,

I am trying to run raxml on a directory of alignments. Unfortunately, my alignment sequence ID's contain illegal characters( the characters " ; " , " : " and " , "). I figured that I could make a regular expression to match these characters and replace them with an underscore.

I don't have any experience with regular expressions and would like help on this?

Thanks in advance!

unix python • 2.6k views
ADD COMMENT
1
Entering edit mode
6.1 years ago

This demonstrates a regex pattern that may help with replacements:

$ echo "foo;bar:baz,bing\$beep*bop;bloop" | sed -e 's/[;|:|,]/_/g'
foo_bar_baz_bing$beep*bop_bloop

That [;|:|,] is a pattern specifying matches on those characters. The underscore character is what sed replaces them with when there is a match, or "hit". The g specifier does a global replacement, where there are multiple hits. Without it, sed will replace the first hit and leave the rest untouched, e.g., see the following result and compare it with the first result:

$ echo "foo;bar:baz,bing\$beep*bop;bloop" | sed -e 's/[;|:|,]/_/'
foo_bar:baz,bing$beep*bop;bloop

If you are using Python, take a look at the re library:

>>> import re
>>> str = "foo;bar:baz,bing\$beep*bop;bloop"
>>> re.sub('[;|:|,]', '_', str)
'foo_bar_baz_bing$beep*bop_bloop'
ADD COMMENT
0
Entering edit mode

Thanks Alex much appreciated

ADD REPLY
0
Entering edit mode

Could use this to look at an entire directory as well? @Alex or am I not understanding how to use the sed command?

ADD REPLY
0
Entering edit mode

Perhaps combine a for loop with walk, to loop over all files in a directory: https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory

Inside the loop, you could perhaps run a function to modify the alignments:

  1. If alignments are in BAM format, convert to SAM (use subprocess.check_call in Python).
  2. Open the SAM file to read line by line (with open("reads.sam", "r") as in, open("readsModified.sam", "w") as out: etc.).
  3. Per line, break up the line into elements, rename sequence IDs in the specific field containing the ID (QNAME?) using the regex substitution. You probably don't want to run the regex on the entire line as it will likely corrupt the quality string and possibly the attributes, as well as the record.
  4. Write the modified SAM line out to the readsModified.sam file.
  5. Convert the modified SAM file back to BAM, using different filenames at each write and conversion step, so that you don't clobber the original files.
  6. Cleanup: remove the SAM files, as they aren't needed any longer.

The biostars search engine can help with how to convert BAM->SAM and back again. Stack Overflow and the answer above for the rest.

ADD REPLY

Login before adding your answer.

Traffic: 1061 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6