Biostar Beta. Not for public use.
Phased Genotypes information
0
Entering edit mode
21 months ago
paolo002 • 140

Hi all, first of all I apologise in case the following has been asked in previous posts but I am not able to find the solution to my problem.

I have a data frame with a list of SNPs, their locations and the information of the Reference (REF) and Alternate (ALT) alleles. In addition I have information about the phased genotypes for a list of various individuals.

Example

   SNP       CHR    POS       REF  ALT  ID1    ID2   ID3 
rs2754554    1      8656      A    C    0|1    0|0   1|1
rs1111786    16     975544    T    A    0|0    0|1   1|0
rs986355     7      75987     G    T    1|1    0|1   1|1
rs 2256743   21     442324    G    C    1|0    0|1   0|1

In the example I have only 4 SNPs and 3 individuals but the list is much larger. I would like to modify the genotype information to be replaced with the corresponding alleles based on the information of the REF and ALT columns:

Desired output:

 SNP       CHR      POS       REF  ALT  ID1    ID2   ID3 
rs2754554    1      8656      A    C    A|C    A|A   C|C
rs1111786    16     975544    T    A    T|T    T|A   A|T
rs986355     7      75987     G    T    T|T    G|T   T|T
rs2256743    21     442324    G    C    C|G    G|C   G|C

The output is based on my understanding that if it is 0 it means equal to reference while 1 equals to alternate. Any help highly appreciated.

R Linux • 270 views
ADD COMMENTlink
0
Entering edit mode

Hi, thanks for your solutions to the problem, sorry the dots are there by mistake...so do not consider them, the rest is correct.thanks!

ADD REPLYlink
2
Entering edit mode
14 months ago
France/Nantes/Institut du Thorax - INSE…

using awk (I don't understand the meaning of those dots...)

awk  '(NR==1) {print; next;}{printf("%s %s %s %s %s",$1,$2,$3,$4,$5);for(i=6;i<=NF;i++) {split($i,a,/[\|\.]/);printf("\t%s|%s",(a[1]=="0"?$4:$5),(a[2]=="0"?$4:$5));} printf("\n");}' input.txt
SNP.       CHR.    POS.     REF  ALT  ID1  ID2 ID3 
rs2754554. 1. 8656. A. C    A.|C    A.|A.   C|C
rs1111786. 16. 975544 T A   T|T T|A A|T
rs986355. 7. 75987. G. T.   T.|T.   G.|T.   T.|T.
rs2256743. 21. 442324. G. C C|G.    G.|C    G.|C
ADD COMMENTlink
2
Entering edit mode
4 months ago
Germany

What are these dots?

$ awk -v OFS="\t" 'NR>1 { for(i=6;i<=NF;i++) {gsub("0",$4,$i); gsub("1",$5,$i) }}1' input.txt|sed 's|\.\.|\.|g'

EDIT: Of course Pierre was faster ;)

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1