Question

Issue with convertf to EIGENSTRAT on (relatively) large datasets

0

Entering edit mode

4.4 years ago

mc617 • 0

Hi all, I was wondering if you encountered any problems in converting plink files of 1000 Genomes Project samples + some addittional samples to EIGENSTRAT or PACKEDPED format.

The files I am using have 2.281.905 markers and 1686 samples, and I would like to use smartpca using a subset as reference (lsqproject) , however it seems that ped/map files are not the correct input for smartpca, and convertf does not complete any job neither for EIGENSTRAT or PACKEDPED format.

Any suggestions? I thought it was a memory/data set issue (there is a limit of data sets between 2 billion and 8 billion genotypes, right?), but it should not in this case. On a smaller dataset, it worked perfectly.

Thank you!

smartpca covertf eigenstrat 1000 Genomes Project • 1.9k views

ADD COMMENT • link 4.4 years ago by mc617 • 0

0

Entering edit mode

Is there a reason why plink —pca isn’t sufficient for your use case?

ADD REPLY • link 4.4 years ago by chrchang523 10k

0

Entering edit mode

Hi, thanks for your question! I have been suggested that with smartpca I could use a set as reference and then, specifying lsqproject, project my samples that have much missingness compared to the samples I d like to use as reference. How would plink --pca handle this? Thanks

ADD REPLY • link 4.4 years ago by mc617 • 0

0

Entering edit mode

https://www.cog-genomics.org/plink/1.9/strat#pca

"If clusters are defined (via --within), you can base the principal components off a subset of samples and then project everyone else onto those PCs with --pca-cluster-names and/or --pca-clusters. --pca-cluster-names accepts a space-delimited sequence of cluster names on the command line, while --pca-clusters takes the name of a file with one cluster name per line. If you also want the MAFs used in the relationship matrix calculation to be based on only samples in those clusters, dump those MAFs in a separate run with --freq[x] + --keep-cluster-names/--keep-clusters, and then load them during your PCA run with --read-freq."

ADD REPLY • link 4.4 years ago by chrchang523 10k

0

Entering edit mode

Thank you so much -that is great! aplogise , I did not notice it before...

I am probably making a silly mistake again, but the --read-freq does not accept .frq.strat files (from --freq + --keep-clusters); if I keep the columns as if it was a .frq file the output of --pca is empty. Thanks

ADD REPLY • link 4.4 years ago by mc617 • 0

1

Entering edit mode

Oops, the silly mistake is actually on my end; the documentation needs to be tweaked a bit to work around the fact that --read-freq doesn't actually work on .frq.strat files. I'll take care of that tomorrow.

In the meantime, replace --keep-clusters with --keep on a file containing all the original-PCA sample IDs (and don't use --within on that run).