Biostar Beta. Not for public use.
Question: How to extract filename and change text in the same file
1
Entering edit mode

Hello,

I have about 30 VCF files with file names as ID_001.new.vcf. I want to extract only the "ID_001" part from the file name and change it in the header line of the VCF file where "Sample1" is given.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1

So that the result looks like that:

 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ID_001

How can I do it ? I tried to use echo in bash and extract the IDs from the Filename but I am unable to iterate it to change inside the file. Thanks for your help.

ADD COMMENTlink 12 months ago Inquisitive8995 • 100 • updated 12 months ago Malcolm.Cook ♦ 1.0k
Entering edit mode
0
  1. Extract sample names from VCF using bcftools (query -l)
  2. Prepare a new file with sample names (new names) one per line in the order of sample names from point 1
  3. Use bcftools reheader option to change the sample names from point 2.

Take a back up of original file before proceeding.

ADD REPLYlink 12 months ago
cpad0112
11k
2
Entering edit mode

In bash this should do.

for i in *.new.vcf
do
        ID_NAME=$(basename "$i" .new.vcf)
        sed -i "1s|Sample1|$ID_NAME|g" $i
done

Caution: I have used -i with sed. So the actual files will get edited in place.

Now added 1s also as to limit the replacement to first line alone.

ADD COMMENTlink 12 months ago Jeffin Rockey ♦ 1.1k
Entering edit mode
2

I think would be better to use 'bcftools view --samples-file` than sed

ADD REPLYlink 12 months ago
Pierre Lindenbaum
120k
Entering edit mode
0

Hi Pierre, I did not understand. Would bcftools view do any replacement ?

ADD REPLYlink 12 months ago
Jeffin Rockey
♦ 1.1k
Entering edit mode
0

the option sample-file can be used to rename the samples. https://samtools.github.io/bcftools/bcftools.html

This file can also be used to rename samples by giving the new sample name as a second white-space-separated column, like this: "old_name new_name".

ADD REPLYlink 12 months ago
Pierre Lindenbaum
120k
Entering edit mode
0

This works when all files have Sample1 in the file name. Will that be the case?

ADD REPLYlink 12 months ago
Vijay Lakhujani
4.1k
Entering edit mode
0

Yes all files have Sample1

ADD REPLYlink 12 months ago
Inquisitive8995
• 100
Entering edit mode
0

@Jeffin , Thanks for your response. This line is not the first line within the file. How can I change sed in a way that it find the particular line where Sample1 is there and then change it to $ID_NAME ?

ADD REPLYlink 12 months ago
Inquisitive8995
• 100
Entering edit mode
1

Changing 1s| to simply s| will do replacements for all Sample1 occurrences.

ADD REPLYlink 12 months ago
Jeffin Rockey
♦ 1.1k
Entering edit mode
0

Thanks a lot. This worked !

ADD REPLYlink 12 months ago
Inquisitive8995
• 100
3
Entering edit mode

If you have GNU parallel installed, you can use it instead of a bash for loop:

parallel 'sed -i "s|Sample1$|{=s/.new.vcf$//=}|"' {} ::: *.new.vcf
ADD COMMENTlink 11 months ago Malcolm.Cook ♦ 1.0k
Entering edit mode
0

Hi Malcom, The suggested command appears to be super efficient, even though I did not understand many of the usages. Can you please explain the {=s/.new.vc$f//=}, {}, ::: etc

ADD REPLYlink 11 months ago
Jeffin Rockey
♦ 1.1k
Entering edit mode
1

Sure.

In general, in your command line:

  • {} gets replaced with the file being processed.
  • {=perl expression=} gets replaced with the value of a perl expression being evaluated in the context of the perl variable $_ being set to the name of the file being processed.

So, in my example, we are using sed to replace the word "Sample1' appearing at the end of line with the result of removing the trailing .new.vcf from each filename.

Documentation for this can be found in parallel's manpage by searching for "{=perl expression=}", and where you can also read

::: arguments
Use arguments from the command line as input source instead of stdin (standard input). 
ADD REPLYlink 11 months ago
Malcolm.Cook
♦ 1.0k
Entering edit mode
0

Fix: vc$f -> vcf\$.

Also try: parallel --plus 'sed -i "s|Sample1$|{%.new.vcf}|"' {} ::: *.new.vcf

ADD REPLYlink 11 months ago
ole.tange
♦ 3.4k
Entering edit mode
0

Hi Ole,

Could you please point me to some link or so which would help me understand the {},::: etc.

ADD REPLYlink 11 months ago
Jeffin Rockey
♦ 1.1k
Entering edit mode
0

It is covered in GNU Parallel 2018 chapter 5 (Online https://doi.org/10.5281/zenodo.1146014, printed www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html)

ADD REPLYlink 11 months ago
ole.tange
♦ 3.4k
Entering edit mode
0

thanks for the fix and the alternate!

ADD REPLYlink 11 months ago
Malcolm.Cook
♦ 1.0k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0