Remove rows based on duplicates in one column
4
0
Entering edit mode
8.1 years ago
thomasterTW ▴ 20

I'm doing RNAseq analysis to determine differential expression patterns. After running a custom R script to determine the cut off between expressed and unexpressed genes (kernel density plot), I get a list with 4 columns.

Ensembl ID          Short gene name     Locus                       10Log FPKM
ENSG00000201699     RNU1-59P            chr1:144534037-144534199    1.587315
ENSG00000215861     WI2-1896O14.1       chr1:144676873-144679969    0.975001
ENSG00000254539     RP4-791M13.3        chr1:144833167-144835867    -0.38691
ENSG00000225241     RP11-640M9.2        chr1:144593362-144621656    0.982576
ENSG00000203843     PFN1P2              chr1:144612265-144612683    -0.55001
ENSG00000235398     LINC00623           chr1:144275917-144311653    -0.44573 <-- dup
ENSG00000235398     LINC00623           chr1:144299757-144341756    0.524043 <-- dup
ENSG00000207106     RNVU1-4             chr1:144311212-144311376    0.949058
ENSG00000231360     AL592284.1          chr1:144339737-144521058    -0.37044
ENSG00000236943     RP11-640M9.1        chr1:144456137-144521970    0.434007

These lists averagely contain 25k genes. As you can see, there's a duplicate in this list (LINC00623). What I want to do is write a script (preferably perl or R, or maybe something like awk in the command line) to find these duplicates based on the Ensembl ID, and remove the line with the lowest FPKM (since it is the same gene averagely on the same locus but apparently cufflinks decides that it is expressed twice). I can write a script to determine this for my example, but the problem is that there can be genes in between the duplicates, so I need to find duplicates in the entire column. I haven't been able to figure this out so I really hope someone can help me.

Thanks in advance!

RNA-Seq duplicates • 2.5k views
ADD COMMENT
1
Entering edit mode

I would first identify why cufflinks decided that it is expressed twice...

If you do in the terminal:

cut -f 2 file.txt | uniq -d > duplicates

Then you know which "short gene names" are duplicates and you could identify a pattern or something and the number of them.

ADD REPLY
0
Entering edit mode

I'd agree with this, I guess it's something to do with the XLOC codes produced and how your script handles those. I'd also highly suggest that you try the htseq_count -> DESeq2 method for differential gene expression, if anything just to compare results.

ADD REPLY
2
Entering edit mode
8.1 years ago
5heikki 11k

In shell:

export LC_ALL=C; sort -k1,1 -k4,4gr file | sort -mu -k1,1 > out

or with proper GNU sort

export LC_ALL=C; sort --parallel $NUMCORES(max 8) -k1,1 -k4,4gr file | sort -mu -k1,1 > out
ADD COMMENT
0
Entering edit mode

Thanks! This works very well

ADD REPLY
0
Entering edit mode
8.1 years ago
george.ry ★ 1.2k

Sort the table by the last column:

df = df[order(-df$Log10_FPKM),]

Then remove duplicates based on first column:

df[!duplicated(df$Ensembl_ID),]
ADD COMMENT
0
Entering edit mode
8.1 years ago
Charles Plessy ★ 2.9k

Not in R, but have a look at bedtools groupby.

ADD COMMENT
0
Entering edit mode
8.1 years ago
Veera ▴ 90

In R you can do by splitting the data frame using ensemblID column and extracting the rows with maximum FPKM and combining them in to a data frame.

dfm <- read.table(..)

dfm.lst <- split(dfm, dfm$EnsemblID)
dfm.lst.dupRemoved <- sapply(dfm.split, function(x) x[x$10LogFPKM == max(x$10LogFPKM), ])
dfm.dupRemoved <- do.call("rbind",df.lst.dupRemoved)

write.table(dfm.dupRemoved, "youNameIt", row.names = FALSE, quote = FALSE, sep = "\t")
ADD COMMENT

Login before adding your answer.

Traffic: 2385 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6