Question

Changing sample name in multiple VCF files

0

Entering edit mode

2.9 years ago

USA_225478 • 0

Hi everyone,

This is more of a scripting question but hopefully someone can help.

I have used GATK's HaplotypeCaller to call SNPs for 150 samples, and Picard's GatherVcfs to merge each sample into a single GVCF file. I now want to import the 150 merged GVCFs into GenomicsDB to perform joint genotyping.

Somewhere along the way every sample has been renamed 'Sample1' and so GenomicsDBImport is throwing out a duplicated samples error.

Does anyone know an efficient way of replacing the sample names in all 150 files? I thought about doing a nested loop in Linux something like:

for F in $(cat $fileList)  
do
    for G in $(cat $newNames)  
    do
        bcftools reheader ${F}.g.vcf.gz -s $G 
    done
done

But

a) I'm not sure if I have that loop set up correctly

b) I'm getting more confused by bcftools requiring a file and not a string as input. Would I need to create 150 files with a single name in and then provide a list of those file within newNames?

Any help would be much appreciated!

gatk bcftools linux vcf • 2.1k views

ADD COMMENT • link updated 2.9 years ago by Kevin Blighe 87k • written 2.9 years ago by USA_225478 • 0

score 0 · Answer 1 · 2021-05-25

0

Entering edit mode

2.9 years ago

Kevin Blighe 87k

You could try this, assuming that your VCFs are in a directory called vcfs/:

find vcfs/ -name "*.vcf.gz" | while read vcf ;
do
  echo -e "--input file is:\t""${vcf}" ;
  out=$(echo "${vcf}" | sed 's/\.vcf\.gz/_reheader.vcf.gz/g') ;
  echo -e "--output file is:\t""${out}" ;
  bcftools reheader \
    --samples id_lookup.txt \
    "${vcf}" -Oz > "${out}" ;
  echo "Done." ;
done ;

id_lookup.txt looks like (space-delimited):

oldname1 newname1
oldname2 newname2
oldname3 newname3
oldname4 newname4

Kevin

ADD COMMENT • link 2.9 years ago by Kevin Blighe 87k

0

Entering edit mode

EDIT: Oops, I just realised it's given the last name in id_lookup.txt to each sample in the subset I tested it on, any ideas?? :/

ADD REPLY • link 2.9 years ago by USA_225478 • 0

0

Entering edit mode

Perhaps paste some of the IDs that you have? If there are special characters, this can cause a problem

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin, thanks for getting back to me. This ids are just alphanumeric, no symbols. Eg P0811Y23, P0811Y24 etc

ADD REPLY • link 2.9 years ago by USA_225478 • 0

0

Entering edit mode

There may be an issue with line-end encodings in your samples files - not sure. I cannot see how they appear your VCF or samples file. If I recall, the delimiter is a space. Also, keep in mind that we are referring to a multi-sample single VCF file, right? You are saying that, e.g., if you supplied my file above, it would convert everything to newname4?

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Ah no, this is where the confusion lies! I have 150 single-sample vcf files and each one has the same sample name. So I want to rehead the nth vcf file with the nth row from id_lookup.txt. This is why I think I’d need a nested loop but I can’t get it to work :/

ADD REPLY • link 2.9 years ago by USA_225478 • 0

0

Entering edit mode

In that case, you could still use a single loop, and your input [to the loop] would be the llst of VCFs plus the list of new sample names, and both of these would be perfectly aligned by row. Here is a mix of code and pseudocode:

paste VCFs.list newnames.list | while read vcf newname ;
do
  # extract old name from input VCF (see https://www.biostars.org/p/139362/)
  # write old name and new name to a space-delimited temporary file, rename.txt.tmp
  # rename via bcftools reheader -s rename.txt.tmp
done ;
rm rename.txt.tmp ;

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k