012 genotype matrix using vcf tools (converting rows to columns)
1
0
Entering edit mode
6.6 years ago
Ana ▴ 200

Hello everyone,

I have created a 012 genotype matrix by using --012 command in vcf tools which give individuals in row and SNPs in columns.

I want to run smartpca and and need to convert rows to columns. Means get SNPs in rows and individuals in columns.

I have tried some shel commands like this but seems it does not work.

awk ' { 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    } } NF>p { p = NF } END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    } }'

I wonder if anyone has any idea how to do that?

Is there any way to do this manipulation in R? but R cannot read my output genotype matrix!

Thanks

genotype-matrix vcftools • 5.0k views
ADD COMMENT
0
Entering edit mode

Please provide a few lines of your matrix (with headers). Try datamash (GNU tool available in most of the linux repos) to transpose data easy. input:

  $ cat test.txt 
    Sample  id-123  id-99   id-42   id-13
    Count   1002    990 2030    599

output:

$ datamash transpose < test.txt  
Sample  Count
id-123  1002
id-99   990
id-42   2030
id-13   599
ADD REPLY
3
Entering edit mode
6.6 years ago

You just literally want to transpose the matrix?

It works for me in R Programming Language, just use the t() function:

df <- t(read.table("out.012", header=FALSE))

This will give you the 012 matrix without rownames or colnames. The use of the [-1,] sub-setting just gets rid of the uninformative header that VCFtools outputs with -012

To add the colnames, just use:

colnames(df) <- read.table("out.012.indv")[,1]

I will let you do the rownames for yourself.

ADD COMMENT
0
Entering edit mode

Hi @Kevin Blighe, thank you for your response. Well are you sure that you could transpose 012 genotype matrix created by vcf-tools in R? This is my out pot from vcf file (012-out)

> 0 -1  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   -1  0   0   0   0   0   1   0   0   0   0   0   0   -1  0   -1  -1  -1  0   0   -1  -1  -1

and R cannot read this file.I use this just to see

> df <- read.table(file = "012-out.txt", header=FALSE)
> head(df)

and the output doesnot make sense

> V636126 V636127 V636128 V636129 V636130 V636131 V636132 V636133
> V636134 V636135 V636136 V636137 V636138 V636139 V636140 V636141
> V636142 V636143 V636144 V636145 V636146 V636147 V636148 V636149
> V636150 V636151 V636152 V636153 V636154 V636155 V636156 V636157
> V636158 V636159 V636160 V636161 V636162 V636163 V636164 V636165
> V636166 V636167 V636168 V636169

I do not know what is wrong here and how can I solve it! I would sincerely appreciate if you could share your experience with me to get this done.

ADD REPLY
0
Entering edit mode

Yes, it genuinely works for me. I use VCFtools (0.1.15) and Microsoft R Open 3.2.5. You haven't done any conversions on the file after it is produced by VCFtools?

Your output above with the 'V' prefix is just the colnames that R automatically generates. I assume that there are 636,169 variants in your VCF?

Also, just to confirm, how many individuals are in your VCF? - 1 or more? The number of rows in the output file from VCFtools should match the number of samples in the VCF.

If you have >1 individuals, the line to read-in and transpose is:

df <- t(read.table("012-out.txt", header=FALSE))

If you just have 1 individual, then this works:

df <- t(read.table("012-out.txt", header=FALSE))
ADD REPLY
0
Entering edit mode

Hi @Kevin Blighe, Thanks. The the example above is just with subset of the data. yes! you are right, did you also get output with "V" prefix? By the way I found the solution, It works ... thanks so much

ADD REPLY
1
Entering edit mode

No, if I do the command exactly as I have it above (df <- t(read.table("out.012", header=FALSE))), and then run head(df), I get:

   [,1] [,2] [,3] [,4]
V1    0    1    2    3
V2    2    0    1    1
V3    2    2    2    2
V4   -1    2   -1   -1
V5    2   -1   -1   -1
V6   -1   -1   -1    2

This was a VCF file of 4 individuals and ~300,000 variants.

With a single individual, I get:

   [,1]
V1    0
V2    2
V3    2
V4   -1
V5    2
V6   -1

I also tried it on a VCF with >600,000 variants and it still functioned fine.

Which was the solution that worked?

ADD REPLY
0
Entering edit mode

So I created this little Rscript and got my desired output, thanks Kevin for sharing your experience with me:

>     #!/usr/bin/Rscript
>     
>     snps<-read.delim('wild_annuus.snps.fb.lenient.noTE_filt_DP15_GOOD_IND_ld.pruned.0.1._replaced_9.geno.table.012',header=F,row.names=1)
>     pos<-read.delim('wild_annuus.snps.fb.lenient.noTE_filt_DP15_GOOD_IND_ld.pruned.0.1.geno.table.012.pos',header=F)
>     indv<-read.delim('wild_annuus.snps.fb.lenient.noTE_filt_DP15_GOOD_IND_ld.pruned.0.1.geno.table.012.indv',header=F)
>     colnames(snps)<-paste(pos[,1],pos[,2],sep=':')
>     rownames(snps)<-indv[,1]
>     snps<-as.matrix(snps)
>     snps.convert<-t(snps)
>     
>     write.table(snps.convert, file= "snps.convert_all_pruned", col.names = FALSE, row.names = TRUE)
ADD REPLY
0
Entering edit mode

That's great! - good luck with the analysis

ADD REPLY

Login before adding your answer.

Traffic: 1925 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6