Question

Removing duplicated mutations from a txt

1

Entering edit mode

5.2 years ago

zizigolu ★ 4.3k

Hi,

I have parsed my vcf files containing SNPs as below

CHROMOSOME  POSITION    REF ALT SAMPLE
1   782112  G   A   LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1   1026918 C   T   LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1   1133283 C   T   LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1   1431511 G   A   LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1   1742395 C   T   LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1   1864994 G   A   LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1   1914766 C   T   LP6005334-DNA_H01_vs_LP6005333-DNA_H01

But I have duplicate mutation because for example in this sample

~$ grep 152280536 file.txt 
 1    152280536    T    C    LP6008334-DNA_C02_vs_LP6008335-DNA_G01
 1    152280536    T    C    LP6008334-DNA_C02_vs_LP6008335-DNA_G01

I am not sure in which step of data processing and how I could removing duplicated mutations.

Any help please

WGS VCF mutation dndscv • 2.4k views

ADD COMMENT • link updated 5.2 years ago by zx8754 11k • written 5.2 years ago by zizigolu ★ 4.3k

1

Entering edit mode

There are a million ways to do this that are a simple google away:

"unix remove duplicated lines"

e.g.

https://unix.stackexchange.com/questions/30173/how-to-remove-duplicate-lines-inside-a-text-file

ADD REPLY • link 5.2 years ago by Joe 21k

1

Entering edit mode

Use bash uniq:

sort -r file.txt | uniq > output.txt

ADD REPLY • link 5.2 years ago by ahaswer ▴ 150

1

Entering edit mode

sort | uniq is not really necessary when sort -u is an option, I think.

ADD REPLY • link 5.2 years ago by Ram 43k

0

Entering edit mode

Sorry, but some of these solutions destroyed my file

For example

by

 sort -r file.txt | uniq > output.txt

R says that

 2178 (76%) mutations have a wrong reference base. Please confirm that you are not running data from a different assembly or species.

or the command flags non duplicates too

(py3venv) [fi1d18@cyan01 ~]$ awk '!a[$3]++' fil.txt
CHROMOSOME      POSITION        REF     ALT     SAMPLE
1       782112  G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1       1026918 C       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1       2753944 A       G       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
1       2856299 T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
(py3venv) [fi1d18@cyan01 ~]$ awk '!a[$1]++' fil.txt
CHROMOSOME      POSITION        REF     ALT     SAMPLE
1       782112  G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
10      611690  G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
11      445671  C       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
12      243344  T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
13      19025167        A       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
14      19075451        T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
15      20014428        A       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
16      442771  G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
17      45324   T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
18      85012   G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
19      709913  G       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
2       404136  T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
20      215046  C       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
21      9651465 T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
22      16873416        A       G       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
3       160445  G       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
4       207726  G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
5       213846  C       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
6       300572  A       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
7       324430  G       A       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
8       187201  T       C       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
9       846272  A       G       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
X       419002  C       T       LP6005334-DNA_H01_vs_LP6005333-DNA_H01
Y       5531500 G       A       LP6007600_vs_LP6007599
(py3venv) [fi1d18@cyan01 ~]$

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Invest some time in understanding what the options you’ve been given actually do. Don’t just blindly copy from the web.

ADD REPLY • link 5.2 years ago by Joe 21k

0

Entering edit mode

sort has a lot of capabilities when used well. You might wish to set sorting keys (columns to sort with) using the -k M,N option. You can sort by columns that will be the same among identical rows and then use the -u option to pick only unique lines.

ADD REPLY • link 5.2 years ago by Ram 43k

0

Entering edit mode

If you read in the file as a data.frame or data.table into R (!), you can use unique(my_dataframe), no need to sort, if I recall correctly (but it may take some time).

Which function of R returns the error message; it's somewhat difficult to believe it's related to the sorting and uniq'ing though.

ADD REPLY • link 5.2 years ago by Friederike 8.9k

1

Entering edit mode

You are right

I was using dndscv r package by data under unique or any suggested command for removing duplication that returned error

> dndsout = dndscv(mut)
[1] Loading the environment...
[2] Annotating the mutations...
    Note: 3800 mutations removed for exceeding the limit of mutations per gene per sample
Error in dndscv(mut) : 
  2178 (76%) mutations have a wrong reference base. Please confirm that you are not running data from a different assembly or species.

I have also tried this likely amended data by another python packages like OncodriveCLUSTL and OncodriveFML and returned error like

position = int(line['POSITION'])
ValueError: invalid literal for int() with base 10: 'POSITION'

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

0

Entering edit mode

I do not see the usage of unique in your example.

Does that mean the error you're reporting ("X mutations have a wrong reference base") is independent of (any) of the unique commands? Did you check the help of dndscv()?

ADD REPLY • link 5.2 years ago by Friederike 8.9k

0

Entering edit mode

Everything happened when I tried to remove duplicates; I run these package with original file successfully

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

score 1 · Answer 1 · 2019-02-18

1

Entering edit mode

5.2 years ago

Friederike 8.9k

then:

read in the original file into R
use unique(df)
use that dndscv function

and show the entire code and error messages here.

ADD COMMENT • link 5.2 years ago by Friederike 8.9k

0

Entering edit mode

Sorry, after using unique(df) software worked, thank you

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

score 1 · Answer 2 · 2019-02-19

1

Entering edit mode

5.2 years ago

zx8754 11k

There error message from dndscv is not about unique issue. Do not remove duplicates, dndscv keeps only one mutation per gene, regardless of number of sample numbers.

The error is about the genome assembly/species, i.e., positions and ref/alts do not match with dndscv database. Because it is not the same genome, dndscv is discovering more mutations than expected: "exceeding the limit of mutations per gene"

From manuals, ensure you are using "GRCh37/hg19 human genome":

By default, dNdScv assumes that mutation data is mapped to the GRCh37/hg19 assembly of the human genome. Users interested in trying dNdScv on a different set of transcripts, a different assembly or a different species can follow this tutorial.

ADD COMMENT • link 5.2 years ago by zx8754 11k

3

Entering edit mode

Hello. Just a small clarification regarding the sentence:

Do not remove duplicates, dndscv keeps only one mutation per gene, regardless of number of sample numbers

dNdScv does not only keep one mutation per gene. By default, the dndscv function keeps up to 3 mutations per gene per sample. This can be changed using the optional parameter "max_muts_per_gene_per_sample" (see: ? dndscv). The reason for this is to protect dndscv against an inadequate annotation of clusters of mutations by the user. For example, imagine that a simple kataegis event introduces 10 mutations in a gene in a given sample. This may be look like high mutation recurrence in a gene, while in reality this is just a single event. Such events should not be annotated as independent mutations but as a complex mutation, but the "max_muts_per_gene_per_sample=3" by default protects the method against misannotation.

So, dndscv does treat each row in the mutation file as an independent mutation, and if you have duplicate entries in the mutation file, these mutations will be double-counted, which will affect the statistics of the model. That is why dndscv issues a warning when it detects multiple identical mutations. Using unique(df) is a good idea if the dataset contains duplicated rows.

ADD REPLY • link 5.2 years ago by Inigo Martincorena ▴ 220

0

Entering edit mode

Thanks a lot

I used unique(mutations)

Then

> dndsout = dndscv(mutations)
[1] Loading the environment...
[2] Annotating the mutations...
    Note: 6 mutations removed for exceeding the limit of mutations per gene per sample
[3] Estimating global rates...
[4] Running dNdSloc...
[5] Running dNdScv...
    Regression model for substitutions (theta = 3.23).
    Regression model for indels (theta = 0.123)
Warning messages:
1: In dndscv(total) :
  Mutations observed in contiguous sites within a sample. Please annotate or remove dinucleotide or complex substitutions for best results.
2: In dndscv(total) :
  Same mutations observed in different sampleIDs. Please verify that these are independent events and remove duplicates otherwise.
> 
>

I know I have duplicated mutations in same samples, as I am too bad in coding I just hope using unique in R removes duplicates correctly

However, I am not sure how to avoid this warning

Mutations observed in contiguous sites within a sample. Please annotate or remove dinucleotide or complex substitutions for best results.

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

1

Entering edit mode

unique() will only remove lines that are absolutely identical. The warning message indicates that the putative duplicated entries are not entirely identical because they contain different sample ID entries.

I don't know what your data.frame looks like, but if you are sure you want to remove the presumed duplicates (1), you might be able to simply assign the sample ID to the rownames of the data.frame and subsequently ignore the sample ID column from the unique call:

> df <- data.frame(sample = c("A","B","C"), value = c(10,10,100), mutation = c("x","x","y") )
> df
 sample value mutation
1      A    10        x
2      B    10        x
3      C   100        y
> unique(df) ## won't remove anything because all lines are unique thanks to the sample ID
  sample value mutation
1      A    10        x
2      B    10        x
3      C   100        y
> row.names(df) <- df$sample
# now, ignore the sample ID column
> unique(df[, c("value","mutation")])
  value mutation
A    10        x
C   100        y

Note how this will leave you completely blind to which samples that mutation was found in!

If you wanted to somewhat keep track of that, this might be a way (sorry, don't know of a data.frame way although I'm sure there is one):

> library(data.table)
> library(magrittr)
> data.table(df) %>% .[, paste(sample, collapse = ","), mutation]
   mutation  V1
1:        x A,B
2:        y   C

(1) Are you sure that these are duplicates you want to remove? Maybe these mutations are simply fairly common and can be found in numerous samples and are totally legit!

ADD REPLY • link 5.2 years ago by Friederike 8.9k

2

Entering edit mode

Friederike is correct that unique(df) will only remove duplicated rows, which may be your problem (as you say), in which case using unique(df) you will have corrected it. The warning by dndscv is slightly confusing since it will also be triggered by duplicated rows (I should make it clearer).

Friederike is also correct that identical mutations, particularly driver mutations (e.g. BRAF V600E), can appear in multiple samples. The warning only serves to remind the user to make sure that every entry in the mutation table is an independent mutation. Two identical mutations in the cancers of two different patients are genuinely independent events and should not be collapsed into one. However, duplicated rows or multiple observations of the same mutation in phylogenetically-related areas of the same tumour should be counted only once. It is up to the user to ensure that all entries in the table are independent.

ADD REPLY • link 5.2 years ago by Inigo Martincorena ▴ 220

0

Entering edit mode

Sorry, you are mentioning what exactly makes me worrying about removing duplicates as I am not sure what I am removing

This is link of my whole data

https://www.dropbox.com/s/nf0kuo8573zhnio/indel%20%281%29.txt?dl=0 I saw after using unique I am removing a lot of data though

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

2

Entering edit mode

I've shown you all the commands to explore this issue...

> library(magrittr)
> library(data.table)
> indels <- fread("indels.txt", fill = TRUE)
> dim(indels)
[1] 24045     5

> dim(unique(indels))
[1] 22549     5 # there are 1496 true duplicates (= identical lines)

>  indels <- unique(indels)
> indels[, paste(sampleID, collapse = ","), by = c("chr","pos","ref","mut")]
       chr       pos ref mut                                     V1
    1:   1     56987  CA   C LP6005334_DNA_H01_vs_LP6005333_DNA_H01
    2:   1   4133415  AG   A LP6005334_DNA_H01_vs_LP6005333_DNA_H01
    3:   1   4611030   A  AC LP6005334_DNA_H01_vs_LP6005333_DNA_H01
    4:   1   4872820   G  GA LP6005334_DNA_H01_vs_LP6005333_DNA_H01
    5:   1   6092644   A  AG LP6005334_DNA_H01_vs_LP6005333_DNA_H01
   ---                                                             
22425:   X 128553587   A ATT LP6008334_DNA_C02_vs_LP6008335_DNA_G01
22426:   X 130518455   A  AT LP6008334_DNA_C02_vs_LP6008335_DNA_G01
22427:   X 132416170  CA   C LP6008334_DNA_C02_vs_LP6008335_DNA_G01
22428:   X 136291349  AT   A LP6008334_DNA_C02_vs_LP6008335_DNA_G01
22429:   X 145516886  AT   A LP6008334_DNA_C02_vs_LP6008335_DNA_G01
# --> there's 22429 left after this unique step ignoring the sample IDs, i.e. about 
# 120 mutations that are found in more than one sample

# to see which mutations are found in more than one sample, you can grep for the comma that
# we add in the paste command
> indels[, paste(sampleID, collapse = ","), by = c("chr","pos","ref","mut")] %>% .[ grepl(",", V1)]
     chr       pos ref                       mut
  1:   1 215653764  AT                         A
  2:   2  40893359  GT                         G
  3:   2  78324046 CAA                         C
  4:   2 119219369  TA                         T
  5:   7  14143149  CA                         C
 ---                                            
114:  13  54027680  GT                         G
115:   X 131481051   A                        AC
116:  11  49527629   T                        TA
117:  13  46417165   C CTGTGTGTGTGTGTGTGTGTGTGTG
118:  19  56581430   C                        CT
                                                                                V1
  1: LP6005334_DNA_H01_vs_LP6005333_DNA_H01,LP6005935_DNA_G02_vs_LP6005979_DNA_C01
  2: LP6005334_DNA_H01_vs_LP6005333_DNA_H01,LP6008334_DNA_D02_vs_LP6008335_DNA_H01
  3: LP6005334_DNA_H01_vs_LP6005333_DNA_H01,LP6005935_DNA_G02_vs_LP6005979_DNA_C01
  4: LP6005334_DNA_H01_vs_LP6005333_DNA_H01,LP6008460_DNA_G03_vs_LP6008340_DNA_C05
  5: LP6005334_DNA_H01_vs_LP6005333_DNA_H01,LP6008336_DNA_H01_vs_LP6008333_DNA_H01
 ---                                                                              
114: LP6008460_DNA_F02_vs_LP6008340_DNA_E05,LP6008334_DNA_C02_vs_LP6008335_DNA_G01
115: LP6008460_DNA_F02_vs_LP6008340_DNA_E05,LP6008334_DNA_C02_vs_LP6008335_DNA_G01
116: LP6008460_DNA_G03_vs_LP6008340_DNA_C05,LP6008334_DNA_C02_vs_LP6008335_DNA_G01
117: LP6008460_DNA_G03_vs_LP6008340_DNA_C05,LP6008334_DNA_C02_vs_LP6008335_DNA_G01
118: LP6008460_DNA_G03_vs_LP6008340_DNA_C05,LP6008334_DNA_C02_vs_LP6008335_DNA_G01

So, now you do know what you're losing. For example, the insertion of CTGTGTGTGTGTGTGTGTGTGTGTG on chr 13 is found in samples LP6008460_DNA_G03_vs_LP6008340_DNA_C05 and LP6008334_DNA_C02_vs_LP6008335_DNA_G01.

The first run of unique should definitely be kept to remove the true duplicated lines (of identical content); whether you want to keep or remove the duplicated mutations is really a decision you have to based on what the final question is.

ADD REPLY • link 5.2 years ago by Friederike 8.9k

0

Entering edit mode

Thank you to bearing with me

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

0

Entering edit mode

sorry @Inigo Martincorena I just faced to a confusion; I have 139 patients for which of them I have separate called SNP and INDEL by Strelka; So that I have 278 files. When I made input dataframe for dndscv I have this

> length(unique(mutations$sampleID))
[1] 278

Does this influence my results? I mean does your algorithm care about the number of samples? If so here I have 278 samples while I only have 139 patients. Can you guide me please? Thanks

ADD REPLY • link 4.4 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Sorry, but I am sure error of dndscv is because of trying to remove duplicate because original file works well, no doubt

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k