Biostar Beta. Not for public use.
Phased Genotypes information
Entering edit mode
21 months ago
paolo002 • 140

Hi all, first of all I apologise in case the following has been asked in previous posts but I am not able to find the solution to my problem.

I have a data frame with a list of SNPs, their locations and the information of the Reference (REF) and Alternate (ALT) alleles. In addition I have information about the phased genotypes for a list of various individuals.


   SNP       CHR    POS       REF  ALT  ID1    ID2   ID3 
rs2754554    1      8656      A    C    0|1    0|0   1|1
rs1111786    16     975544    T    A    0|0    0|1   1|0
rs986355     7      75987     G    T    1|1    0|1   1|1
rs 2256743   21     442324    G    C    1|0    0|1   0|1

In the example I have only 4 SNPs and 3 individuals but the list is much larger. I would like to modify the genotype information to be replaced with the corresponding alleles based on the information of the REF and ALT columns:

Desired output:

 SNP       CHR      POS       REF  ALT  ID1    ID2   ID3 
rs2754554    1      8656      A    C    A|C    A|A   C|C
rs1111786    16     975544    T    A    T|T    T|A   A|T
rs986355     7      75987     G    T    T|T    G|T   T|T
rs2256743    21     442324    G    C    C|G    G|C   G|C

The output is based on my understanding that if it is 0 it means equal to reference while 1 equals to alternate. Any help highly appreciated.

R Linux • 270 views
Entering edit mode

Hi, thanks for your solutions to the problem, sorry the dots are there by do not consider them, the rest is correct.thanks!

Entering edit mode
14 months ago
France/Nantes/Institut du Thorax - INSE…

using awk (I don't understand the meaning of those dots...)

awk  '(NR==1) {print; next;}{printf("%s %s %s %s %s",$1,$2,$3,$4,$5);for(i=6;i<=NF;i++) {split($i,a,/[\|\.]/);printf("\t%s|%s",(a[1]=="0"?$4:$5),(a[2]=="0"?$4:$5));} printf("\n");}' input.txt
SNP.       CHR.    POS.     REF  ALT  ID1  ID2 ID3 
rs2754554. 1. 8656. A. C    A.|C    A.|A.   C|C
rs1111786. 16. 975544 T A   T|T T|A A|T
rs986355. 7. 75987. G. T.   T.|T.   G.|T.   T.|T.
rs2256743. 21. 442324. G. C C|G.    G.|C    G.|C
Entering edit mode
4 months ago

What are these dots?

$ awk -v OFS="\t" 'NR>1 { for(i=6;i<=NF;i++) {gsub("0",$4,$i); gsub("1",$5,$i) }}1' input.txt|sed 's|\.\.|\.|g'

EDIT: Of course Pierre was faster ;)


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1