Biostar Beta. Not for public use.
PLINK assoc input: can it be two vcf files case/control?
0
Entering edit mode
2.1 years ago
Pin.Bioinf • 240
@Pin.Bioinf38957

Hello, I want to do an association test with plink.

I have two vcf files: one has all common variants for case population, the other has all common variants for control population. I want to feed these vcfs to plink. I don't understand the input plink needs for --assoc command. Is there any way I can do this? Or how can I reformat the vcf files to feed them to plink --assoc?

Thank you!

PLINK vcf snps association • 334 views
2
Entering edit mode
2.1 years ago
@Kevin Blighe41557

Merge the VCFs using bcftools merge and then input to PLINK using --vcf or --bcf. When inputting, I highly recommend using the flag --indiv-sort file so that you control exactly the order of the samples. I go over an example of this, here: A: Different samples order in .ped and phynotype data files

Without the use of the sort file, PLINK will highly likely input the samples in a different order than they appear in the VCF / BCF

The order in your sort file should then match the custom FAM file that you also create, which you specify for use when you run --assoc. I go over this, here: A: linkage disequilibrium analysis

So an entire pipeline would look like:

bcftools merge ... ... > MyVars.bcf

plink --noweb --bcf MyVars.bcf --keep-allele-order --indiv-sort file MySampleListing.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out MyVars ;

plink --noweb --assoc --ci 0.95 --counts --bfile MyVars --assoc --fam MyVars.Custom.fam ;

0
Entering edit mode

Thanks a lot. Am I correct to think MySampleListing.list should be like:

0 Sample1
0 Sample2
..
..
0 Sample15
..
0 Sample30


?

And how about the FAM file? ( only have phenotype (condition) information, but I put it in the FAMILY ID

Family ID   Individual ID   Paternal ID Maternal ID Gender  Phenotype   Population  Relationship    Siblings    Second Order    Third Order Other Comments
Post    Sample1 0   0   0   0   0   0   0   0   0   0
Post    Sample2 0   0   0   0   0   0   0   0   0   0
Post    Sample3 0   0   0   0   0   0   0   0   0   0
Post    ... 0   0   0   0   0   0   0   0   0   0
Mx  ... 0   0   0   0   0   0   0   0   0   0
Mx  ... 0   0   0   0   0   0   0   0   0   0
Mx  ... 0   0   0   0   0   0   0   0   0   0
Mx  Sample30    0   0   0   0   0   0   0   0   0   0

1
Entering edit mode

Your sort file (MySampleListing.list) looks okay. Those IDs should all be in your VCF. If not, a warning or error will be thrown (I believe).

Your FAM has too much information. The minimal needed is:

• Family ID (FID)
• Individual ID (IID)
• Paternal ID (PID)
• Maternal ID (MID)
• Gender (1, male; 2, female)
• Phenotype/Disease status (1, control; 2, case/disease)

Most of these can be left as -9 or 0. An association test just looks at Phenotype, and uses the IID to match up to your PLINK dataset.

When you input the data to PLINK, you can double check the sample order of the new dataset by outputting the new dataset again and looking at the header of the output PED file, e.g., --recode transpose

0
Entering edit mode

I get the error in which says 'Error: --indiv-sort file does not contain all loaded sample IDs.' here is what my vcf looks like:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1.bam Sample2.bam Sample3.bam Sample4.bam Sample5.bam Sample20.bam... Sample25.bam ...    Sample30.bam


Also, after I removed the option 'indiv-sort' just to check what happened, I kept on with the pipeline, and in the last command for the association analysis I used this .fam:

0   Sample1.bam 0   0   0   1
0   Sample2.bam 0   0   0   1
0   ....bam 0   0   0   1
0   Sample25.bam    0   0   0   1
0   ....bam 0   0   0   0   2
0   Sample26.bam    0   0   0   0   2
0   ....bam 0   0   0   0   2
0   Sample30.bam    0   0   0   0   2


(30 lines one for each sample in the vcf)

I got this message: "No phenotypes present" "Skipping --assoc/--model since less than two phenotypes are present."

1
Entering edit mode

For good reason, there is strict enforcing in the matching of the samples in the VCF with those in the sort file specified with --indiv-sort file. Also, I believe the sort file should be tab-delimited. If you only have 30 samples, you should be able to resolve the discrepancy quite quickly.

Alternatively, you can indeed skip the --indiv-sort file part and proceed from there. In this case, it is absolutely essential that your FAM file sample order is the exact same as the sample order in your PLINK dataset. Do not make any assumptions in this regard, because PLINK makes no checks.