How to sequentially replace a string with numerical value?
7
1
Entering edit mode
7.3 years ago

Hi everyone,

I have a list of snps with string chromosome names that I want to replace with a sequential numerical value (preparing a qqman input file). For example, turn this:

chrA
chrA
chrA
chrB
chrB
chrC
chrC
chrC
chrC
...
chrZ
chrZ

into this

1
1
1
2
2
3
3
3
3
..
26
26

At the moment I am doing this with a script, using something similar to this:

sed 's/chrA/1/g'
sed 's/chrB/2/g'
sed 's/chrC/3/g'
..
sed 's/chrZ/26/g'

However, my actual genome has a large number of contigs, so this command takes a lot of time to run. Is there a more efficient way to do this?

qqman genome • 2.8k views
ADD COMMENT
5
Entering edit mode
7.3 years ago

Here's a solution with replace subcommand (usage) of csvtk, just download the .tar.gz file, decompress and you can run :)

First you have to prepare a mapping file, which is a plain tab-separated text file, you can easily use a spreadsheet software to create and export.

$ more mapping.tsv
chrA    1
chrB    2
chrC    3

I guess the SNP data file should be a tab-delimited file too. Here's a dummy one:

$ more data.tsv 
chrA    A       z 
chrA    A       x 
chrA    A       c 
chrB    B       v 
chrB    B       d 
chrC    C       tx
chrC    C       t
chrC    C       x
chrC    C       z

Then use csvtk to edit the SNP data file:

$ ./csvtk -H -t replace -f 1 -p '(.+)' -r '{kv}' -k mapping.tsv data.tsv 
[INFO] read key-value file: mapping.tsv
[INFO] 3 pairs of key-value loaded
1       A       z
1       A       x
1       A       c
2       B       v
2       B       d
3       C       tx
3       C       t
3       C       x
3       C       z

The long-option version would be easier to understand:

./csvtk --no-header-row --tabs replace --fields 1 --pattern '(.+)' --replacement '{kv}' --kv-file mapping.tsv data.tsv

PS: this is a general method not limited to this case. sed is good for single replacement, csvtk replace -k can handle multiple replacements well, which is written in Go with good performance.

PS2: seqkit has exactly the same function to handle FASTA/Q files.

ADD COMMENT
0
Entering edit mode

This solution works perfectly, and it is extremely quick, thank you very much!

ADD REPLY
4
Entering edit mode
7.3 years ago
Charles Plessy ★ 2.9k

Perhaps you can do it in R by converting to factor coercing it to numerical values ?

> l <- c("chrA", "chrA", "chrA", "chrB", "chrB", "chrC", "chrC", "chrC", "chrC")
> as.numeric(factor(l))
[1] 1 1 1 2 2 3 3 3 3
ADD COMMENT
0
Entering edit mode

now that I've taken a look at what is qqman, I like this approach better than any other, including mine, since qqman is an R pipeline, and this is an R solution. I'd go with something like:

gwasResults$CHR=as.numeric(factor(gwasResults$CHR))
ADD REPLY
1
Entering edit mode
7.3 years ago
bruce.moran ▴ 960

Hi,

I would use command line Perl for this, specifically for the 'ord' function:

perl -ane '$F[0]=~s/chr//;$o=ord($F[0])-64;print "$o\n";' <input.file>

You still have to sed the "chr", then find the ord of that, then subtract 64 as the ord returns the ASCII numeric value.

Hope that helps.

ADD COMMENT
1
Entering edit mode
7.3 years ago
biocyberman ▴ 860

Bash way if you prefer. Here are some cool bash tricks for you.

chr=( chr{A..Z} ) # make an array for chromosomes.
# now loop through the array to build sed command
sedcmd=""
for i in  {1..26}; do
 sedcmd=$sedcmd"s/${chr[$i]}/$i/;" # build up 26 sed replacement command :)
done
# run sed
sed -e "$sedcmd" inputfile > outputfile

Because sed works line by line and you may want to replace only the first match in the line so no need for the g option.

ADD COMMENT
1
Entering edit mode
7.3 years ago
Malcolm.Cook ★ 1.5k

I'd say the problem is underspecified....

  • it looks like the chromosomes are all in order in the example data, but we are not told that this is guaranteed. Can we really depend on this? For instance, might "chrA" pop up again after seeing "chrB"?
  • it looks like they all begin with "chr" followed by a single letter, but is that really true of the actual data? The OP writes that there "a large number of contigs". I'm guess this is more than the 26 in the example data.
  • it looks like the list of chromosomes is "known" up front. It it really?
  • it looks like the input data has only a single column that needs to be mapped into integers. Really?!?

Here is a perl one-liner that only makes the last assumption... it builds the name-to-integer mapping "on the fly" in a data driven way.

perl -lpe '$_ = $x{$_} //= keys(%x)'  chom_names.txt > chrom_nums.txt

It uses a few "tricky" aspects of Perl, but they are worth studying and knowing.

The following modification avoids making that last assumption. It uses a few more "tricky" aspects. You can provide which column contains the chromosome in switch -col, and provide an output delimiter (tab, in this case) in switch -delim

 perl -slape 'BEGIN{$" = $delim} $F[$col] = $x{$F[$col]} ||= keys(%x); $_ = "@F"'  -- -col=2 -delim=$'\t' chom_names.txt > chrom_nums.txt

Good luck!

ADD COMMENT
0
Entering edit mode

Nice ! Maybe perl -nE 'say $x{$_} //= keys(%x)' would be slightly more modern than perl -lpe '$_=$x{$_}||=keys(%x)'.

ADD REPLY
0
Entering edit mode

Erhm, you caught me! I've not yet made the jump to Perl 6. Do you advise it? Most of what I used to do in Perl I now do in either bash, R, or python anyway... hmmm...

ADD REPLY
0
Entering edit mode

Actually, the "modern" version above is plain perl5 in which all the Perl features have been enabled by using -E instead of -e. Here, it is used for the say function. In Perl 6, the one-liner would look like perl6 -ne 'state %x ; say %x{$_} //= %x.elems + 1'. I am still very much a beginner in Perl 6, but it looks interesting.

ADD REPLY
0
Entering edit mode

Thanks for the education. And, I'm editing my old-school answer to adopt your use of //= instead of ||=, which arguably is preferable.

ADD REPLY
0
Entering edit mode

Despite of Perl6, th best improvement for Perl is Python :) Perl6 is a late "make up" of an old time champion. Just personal opinion.

ADD REPLY
0
Entering edit mode
7.3 years ago

Thank you for all the quick responses. The solution by shenwei356 already worked, but I'll try the other ones and give feedback

ADD COMMENT
0
Entering edit mode
7.3 years ago

sed can read a file of multiple patterns:

sed -f pattern in.txt > out.txt

https://linux.die.net/man/1/sed


   -f script-file, --file=script-file
     add the contents of script-file to the commands to be executed
ADD COMMENT

Login before adding your answer.

Traffic: 2899 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6