How edit VCF file (specifically #CHROM and ID columns)
2
1
Entering edit mode
6.5 years ago
angelaparody ▴ 100

Hi,

I have a .vcf file and I need to edit #CHROM and ID columns because these are problematic when I try to obtain other formats files (specifically the #CHROM column contains scaffold# and the ID colum contains only " . " for all the SNPs).

I have a tab-delimited.txt file with both columns with which I want to edit those two columns of my vcf file. I tried different things from solutions given in biostars but nothing worked for me so far.

So, in short, how could I edit #CHROM and ID columns of a .vcf file using a tab-delimited.txt file? And please, I need the answer to be quite in detail since I am pretty new in all this.

Thanks in advance,

Cheers,

'Angela

formats • 8.8k views
ADD COMMENT
0
Entering edit mode

please show us a few lines of your input, a few lines of your desired output, what are the solutions you have tried ?

ADD REPLY
0
Entering edit mode

nothing worked for me so far

Please be as specific as possible, e.g. include error message or explain how the output you got differs from what you aim to obtain.

ADD REPLY
4
Entering edit mode
6.5 years ago
angelaparody ▴ 100

Hi,

Thanks for the answers.

I was writing all the details but it happened that I found out how to make it work. Well, actually, I have been able to set IDs on the fly with

bcftools annotate --set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' myfilename.vcf > new.vcf

myfilename.vcf was looking like this (first rows and first three columns):

#CHROM  POS ID
scaffold00002   764729  .
scaffold00002   764955  .
scaffold00002   765132  .
scaffold00002   766694  .
scaffold00002   766775  .
scaffold00002   766966  .
scaffold00002   771319  .
scaffold00002   773905  .
scaffold00002   775644  .
scaffold00002   776411  .
scaffold00007   1178023 .
scaffold00007   1178440 .
scaffold00007   1180956 .

and after using the command, the new.vcf file looks like this:

#CHROM  POS ID
scaffold00002   764729  scaffold00002_764729_T_C
scaffold00002   764955  scaffold00002_764955_C_G
scaffold00002   765132  scaffold00002_765132_G_A
scaffold00002   766694  scaffold00002_766694_C_G
scaffold00002   766775  scaffold00002_766775_G_A
scaffold00002   766966  scaffold00002_766966_G_A
scaffold00002   771319  scaffold00002_771319_C_G
scaffold00002   773905  scaffold00002_773905_A_G
scaffold00002   775644  scaffold00002_775644_T_C
scaffold00002   776411  scaffold00002_776411_A_T
scaffold00007   1178023 scaffold00007_1178023_C_T
scaffold00007   1178440 scaffold00007_1178440_A_G
scaffold00007   1180956 scaffold00007_1180956_C_T

So my problem is half solved. I think I won't need to edit CHROM information, although not totally sure since I read somewhere that it might be problematic that this column contains letters. Anyway, if I need to edict CHROM information I will ask again the question.

Thanks again,

Ángela

ADD COMMENT
0
Entering edit mode

Use the code button (the one with 101010 on it in the formatting bar) to format your code so it is more readable.

ADD REPLY
0
Entering edit mode

Sorry about that, I am new here. From now one I will use it. Thanks

'Angela

ADD REPLY
0
Entering edit mode
6.5 years ago
angelaparody ▴ 100

Hi again,

It would actually be very helpful to be able to edit #CHROM column as well. What I am after is to change from this:

#CHROM  POS ID
scaffold00002   764729  scaffold00002_764729_T_C
scaffold00002   764955  scaffold00002_764955_C_G
scaffold00002   765132  scaffold00002_765132_G_A
scaffold00002   766694  scaffold00002_766694_C_G
scaffold00002   766775  scaffold00002_766775_G_A
scaffold00002   766966  scaffold00002_766966_G_A
scaffold00002   771319  scaffold00002_771319_C_G
scaffold00002   773905  scaffold00002_773905_A_G
scaffold00002   775644  scaffold00002_775644_T_C
scaffold00002   776411  scaffold00002_776411_A_T
scaffold00007   1178023 scaffold00007_1178023_C_T
scaffold00007   1178440 scaffold00007_1178440_A_G
scaffold00007   1180956 scaffold00007_1180956_C_T

to this:

#CHROM  POS ID
1   764729  scaffold00002_764729_T_C
1   764955  scaffold00002_764955_C_G
1   765132  scaffold00002_765132_G_A
1   766694  scaffold00002_766694_C_G
1   766775  scaffold00002_766775_G_A
1   766966  scaffold00002_766966_G_A
1   771319  scaffold00002_771319_C_G
1   773905  scaffold00002_773905_A_G
1   775644  scaffold00002_775644_T_C
1   776411  scaffold00002_776411_A_T
2   1178023 scaffold00007_1178023_C_T
2   1178440 scaffold00007_1178440_A_G
2   1180956 scaffold00007_1180956_C_T

Specifically, each scaffold would be changed to a number. I have 126 scaffolds in total, so numbers in #CHROM column would be from 1 to 126. I don't think it would work just replacing the column (but not sure) because of the format of the file (.vcf) which has a header), but again, not sure...(I am pretty new in all this).

I tried this:

## Remove header from txt file
tail -n+2 sub.txt > newpos.txt

## Get header from vcf
grep -P '^#' test.vcf > new.vcf

grep -v -P '^#' test.vcf \
| cut -f3- \
| paste newpos.txt - >> new.vcf

Which was posted: Replace fields CHROM and POS in a vcf file but it didn't work for me: the new.vcf file of the step ##Get header from vcf is empty. I suspect there is something wrong in the code, but as I don't understand completely this language I don't know what could it be. This is what my terminal shows:

MacBook-Pro-de-Angela:EditingCHROMfieldVCF angelaparodymerino$ grep -P '^#' mac3_minDP3_maxmeanDP289_maf005_minQ40_minGQ30_hwe005_265ind_IDs2.vcf > new.vcf

usage: grep [-abcDEFGHhIiJLlmnOoPqRSsUVvwxZ] [-A num] [-B num] [-C[num]]

[-e pattern] [-f file] [--binary-files=value] [--color=when]

[--context[=num]] [--directories=action] [--label] [--line-buffered]

[--null] [pattern] [file ...]

In short, does anyone know how can I edit #CHROM column of a .vcf file from a .txt file?

Thanks in advance,

Regards,

'Angela Parody-Merino

ADD COMMENT
1
Entering edit mode

output:

$sed -n 's/^[a-z]*0*//p ' test

    #CHROM  POS ID
    2   764729  scaffold00002_764729_T_C
    2   764955  scaffold00002_764955_C_G
    2   765132  scaffold00002_765132_G_A
    2   766694  scaffold00002_766694_C_G
    2   766775  scaffold00002_766775_G_A
    2   766966  scaffold00002_766966_G_A
    2   771319  scaffold00002_771319_C_G
    2   773905  scaffold00002_773905_A_G
    2   775644  scaffold00002_775644_T_C
    2   776411  scaffold00002_776411_A_T
    7   1178023 scaffold00007_1178023_C_T
    7   1178440 scaffold00007_1178440_A_G
    7   1180956 scaffold00007_1180956_C_T

Input:

$ cat test
#CHROM  POS ID
scaffold00002   764729  scaffold00002_764729_T_C
scaffold00002   764955  scaffold00002_764955_C_G
scaffold00002   765132  scaffold00002_765132_G_A
scaffold00002   766694  scaffold00002_766694_C_G
scaffold00002   766775  scaffold00002_766775_G_A
scaffold00002   766966  scaffold00002_766966_G_A
scaffold00002   771319  scaffold00002_771319_C_G
scaffold00002   773905  scaffold00002_773905_A_G
scaffold00002   775644  scaffold00002_775644_T_C
scaffold00002   776411  scaffold00002_776411_A_T
scaffold00007   1178023 scaffold00007_1178023_C_T
scaffold00007   1178440 scaffold00007_1178440_A_G
scaffold00007   1180956 scaffold00007_1180956_C_T
ADD REPLY
0
Entering edit mode

Thanks so much! it worked! :)

Regards,

'Angela Parody-Merino

ADD REPLY

Login before adding your answer.

Traffic: 1463 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6