Question

how to substitue Nth character in a string using linux commands

1

Entering edit mode

8.0 years ago

mbk0asis ▴ 680

Hi.

I have a txt file containing mutiple fasta sequences, and I'd like to replace Nth character in each sequence.

If I want to replace 10th character of each sequence in the example below, which linux command can I use?

> sample1
TATCCGATGCGACGTGCAGCG
> sample2    
CTAGCGTAGTGTCGACTGCAT
> sample3
GACTGACGTGACGTAGTCGAC

Thank you!

linux commands • 7.5k views

ADD COMMENT • link updated 8.0 years ago by Daniel ★ 4.0k • written 8.0 years ago by mbk0asis ▴ 680

0

Entering edit mode

8.0 years ago

Daniel ★ 4.0k

In the "There's more than one way to do it" camp, and waving the "one liners are fun" flag, this should do the job:

sed 's/^[^>](.){9}.(.*)/\1X\2/g' yourfile.fasta >newfile.fasta

Replace X in the second part with what your new insert is.

Explanation time:

^[^>] | Line doesn't start with a >
(.){9} | 9 "any character" after each other
. | One character that you're throwing away
(.*) | The rest of the line
/\1X\2/ | Recall the first part, then the new insert, then the second part

See a visual representation here

ADD COMMENT • link 8.0 years ago by Daniel ★ 4.0k

score 7 · Accepted Answer · 2016-04-13

7

Entering edit mode

8.0 years ago

Alex Reynolds 35k

Assuming one-line sequences in your FASTA-formatted input, replace X with your character of interest:

$ awk '{ \
    if ($0 ~ /^>/) { \
      print $0; \
    } \
    else { \
      printf("%s%c%s\n", substr($0, 1, 9), "X", substr($0, 11, length($0) - 10)); \
    } \
  }' in.fa > out.fa

If you have multi-line sequences in your FASTA input, you need to render them into single-line sequences before using this one-liner. See the following Biostars question (and answer) for a solution: Multiline Fasta To Single Line Fasta

ADD COMMENT • link 8.0 years ago by Alex Reynolds 35k

0

Entering edit mode

Thank you, Alex! It works well. I can understand between 'if' and 'else' but can't really figure out what the command behind 'else' mean. Would you explain it to me? Thank you.

ADD REPLY • link 8.0 years ago by mbk0asis ▴ 680

0

Entering edit mode

The printf() command prints a string made up of three format specifiers %s, %c and %s along with a newline character \n.

The first %s corresponds to the string value of substr($0, 1, 9).

The %c corresponds to the character value of X.

The second %s corresponds to the string value of substr($0, 11, length($0) - 10).

The substr() function returns the substring of the string passed in, generated from the starting index and length passed in.

So substr($0, 1, 9) takes the substring of the sequence line $0 from the start index of 1 — the first character — and grabs the first nine characters.

The second substr($0, 11, length($0) - 10) takes the substring of the sequence line from the starting point of 11 characters into the string, with the length of the sequence minus ten characters.

See the awk documentation here for more detail on string functions.

ADD REPLY • link 8.0 years ago by Alex Reynolds 35k

0

Entering edit mode

I really appreciate your detailed explanation. You rock!!!

ADD REPLY • link 8.0 years ago by mbk0asis ▴ 680