how to substitue Nth character in a string using linux commands
2
1
Entering edit mode
8.0 years ago
mbk0asis ▴ 680

Hi.

I have a txt file containing mutiple fasta sequences, and I'd like to replace Nth character in each sequence.

If I want to replace 10th character of each sequence in the example below, which linux command can I use?

> sample1
TATCCGATGCGACGTGCAGCG
> sample2    
CTAGCGTAGTGTCGACTGCAT
> sample3
GACTGACGTGACGTAGTCGAC

Thank you!

linux commands • 7.5k views
ADD COMMENT
7
Entering edit mode
8.0 years ago

Assuming one-line sequences in your FASTA-formatted input, replace X with your character of interest:

$ awk '{ \
    if ($0 ~ /^>/) { \
      print $0; \
    } \
    else { \
      printf("%s%c%s\n", substr($0, 1, 9), "X", substr($0, 11, length($0) - 10)); \
    } \
  }' in.fa > out.fa

If you have multi-line sequences in your FASTA input, you need to render them into single-line sequences before using this one-liner. See the following Biostars question (and answer) for a solution: Multiline Fasta To Single Line Fasta

ADD COMMENT
0
Entering edit mode

Thank you, Alex! It works well. I can understand between 'if' and 'else' but can't really figure out what the command behind 'else' mean. Would you explain it to me? Thank you.

ADD REPLY
0
Entering edit mode

The printf() command prints a string made up of three format specifiers %s, %c and %s along with a newline character \n.

The first %s corresponds to the string value of substr($0, 1, 9).

The %c corresponds to the character value of X.

The second %s corresponds to the string value of substr($0, 11, length($0) - 10).

The substr() function returns the substring of the string passed in, generated from the starting index and length passed in.

So substr($0, 1, 9) takes the substring of the sequence line $0 from the start index of 1 — the first character — and grabs the first nine characters.

The second substr($0, 11, length($0) - 10) takes the substring of the sequence line from the starting point of 11 characters into the string, with the length of the sequence minus ten characters.

See the awk documentation here for more detail on string functions.

ADD REPLY
0
Entering edit mode

I really appreciate your detailed explanation. You rock!!!

ADD REPLY
0
Entering edit mode
8.0 years ago
Daniel ★ 4.0k

In the "There's more than one way to do it" camp, and waving the "one liners are fun" flag, this should do the job:

sed 's/^[^>](.){9}.(.*)/\1X\2/g' yourfile.fasta >newfile.fasta

Replace X in the second part with what your new insert is.

Explanation time:

  • ^[^>] | Line doesn't start with a >
  • (.){9} | 9 "any character" after each other
  • . | One character that you're throwing away
  • (.*) | The rest of the line
  • /\1X\2/ | Recall the first part, then the new insert, then the second part

See a visual representation here

ADD COMMENT

Login before adding your answer.

Traffic: 2555 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6