Phased Genotypes information
2
0
Entering edit mode
5.3 years ago
paolo002 ▴ 160

Hi all, first of all I apologise in case the following has been asked in previous posts but I am not able to find the solution to my problem.

I have a data frame with a list of SNPs, their locations and the information of the Reference (REF) and Alternate (ALT) alleles. In addition I have information about the phased genotypes for a list of various individuals.

Example

   SNP       CHR    POS       REF  ALT  ID1    ID2   ID3 
rs2754554    1      8656      A    C    0|1    0|0   1|1
rs1111786    16     975544    T    A    0|0    0|1   1|0
rs986355     7      75987     G    T    1|1    0|1   1|1
rs 2256743   21     442324    G    C    1|0    0|1   0|1

In the example I have only 4 SNPs and 3 individuals but the list is much larger. I would like to modify the genotype information to be replaced with the corresponding alleles based on the information of the REF and ALT columns:

Desired output:

 SNP       CHR      POS       REF  ALT  ID1    ID2   ID3 
rs2754554    1      8656      A    C    A|C    A|A   C|C
rs1111786    16     975544    T    A    T|T    T|A   A|T
rs986355     7      75987     G    T    T|T    G|T   T|T
rs2256743    21     442324    G    C    C|G    G|C   G|C

The output is based on my understanding that if it is 0 it means equal to reference while 1 equals to alternate. Any help highly appreciated.

R Linux • 943 views
ADD COMMENT
0
Entering edit mode

Hi, thanks for your solutions to the problem, sorry the dots are there by mistake...so do not consider them, the rest is correct.thanks!

ADD REPLY
2
Entering edit mode
5.3 years ago

using awk (I don't understand the meaning of those dots...)

awk  '(NR==1) {print; next;}{printf("%s %s %s %s %s",$1,$2,$3,$4,$5);for(i=6;i<=NF;i++) {split($i,a,/[\|\.]/);printf("\t%s|%s",(a[1]=="0"?$4:$5),(a[2]=="0"?$4:$5));} printf("\n");}' input.txt
SNP.       CHR.    POS.     REF  ALT  ID1  ID2 ID3 
rs2754554. 1. 8656. A. C    A.|C    A.|A.   C|C
rs1111786. 16. 975544 T A   T|T T|A A|T
rs986355. 7. 75987. G. T.   T.|T.   G.|T.   T.|T.
rs2256743. 21. 442324. G. C C|G.    G.|C    G.|C
ADD COMMENT
2
Entering edit mode
5.3 years ago

What are these dots?

$ awk -v OFS="\t" 'NR>1 { for(i=6;i<=NF;i++) {gsub("0",$4,$i); gsub("1",$5,$i) }}1' input.txt|sed 's|\.\.|\.|g'

EDIT: Of course Pierre was faster ;)

ADD COMMENT

Login before adding your answer.

Traffic: 2058 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6