Question

How can I get only the rows when for the converge points : - +

0

Entering edit mode

5.6 years ago

Lila M ★ 1.2k

Hi everybody, Maybe this is more a programming question related, but any advice could be very handy. I'm trying to select the rows when the 4th has convergence points ( - following by +) in this data sets:

chr1    1275000 1284999 +
chr1    1285000 1294999 -
chr1    1295000 1304999 -
chr1    1385000 1394999 -
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1435000 1444999 +
chr1    1715000 1724999 +
chr1    1725000 1734999 -
chr1    1735000 1744999 -
chr1    1745000 1754999 -
chr1    1795000 1804999 -
chr1    1805000 1814999 +
chr1    1815000 1824999 -
chr1    1865000 1874999 -

Expected:

 chr1   1415000 1424999 -
 chr1   1425000 1434999 +
 chr1   1795000 1804999 -
 chr1   1805000 1814999 +

I would like to use R, but I have no idea how to start with it. Any suggestion? Thanks!

R convergence points • 1.6k views

ADD COMMENT • link updated 5.6 years ago by Alex Reynolds 35k • written 5.6 years ago by Lila M ★ 1.2k

2

Entering edit mode

5.6 years ago

Santosh Anand 5.7k

The lag and lead functions of dplyr package allows you to fetch previous or next values from a column, respectively. Here is the complete code

library(dplyr)

## Generate a random df 
set.seed(33)
df = data.frame(x=LETTERS[1:10], y=rnorm(10), z=sample(c("+", "-"), 10, replace = T))

## concatenate previous value (lag), current value, and next value (lead) of z
det = with(df, paste0(lag(z), z, lead(z)))

## check how the df looks with det
cbind(df, det)

> cbind(df, det)
   x           y z  det
1  A -0.13592452 - NA-+
2  B -0.04079697 +  -++
3  C  1.01053901 +  +++
4  D -0.15826244 +  ++-
5  E -2.15663750 -  +--
6  F  0.49864683 -  ---
7  G -0.75524431 -  --+
8  H  0.77858519 +  -+-
9  I  0.75457795 -  +-+
10 J -1.09954561 + -+NA


## you need to get all rows with  "-+" in det
df[grepl("-+", det, fixed=T), ]

   x           y z
1  A -0.13592452 -
2  B -0.04079697 +
7  G -0.75524431 -
8  H  0.77858519 +
9  I  0.75457795 -
10 J -1.09954561 +

ADD COMMENT • link 5.6 years ago by Santosh Anand 5.7k

2

Entering edit mode

5.6 years ago

cpad0112 21k

$  grep  -A 1  "-" test.txt | grep --no-group-separator -B 1 "+"
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1795000 1804999 -
chr1    1805000 1814999 +

$ cat test.txt 
chr1    1275000 1284999 +
chr1    1285000 1294999 -
chr1    1295000 1304999 -
chr1    1385000 1394999 -
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1435000 1444999 +
chr1    1715000 1724999 +
chr1    1725000 1734999 -
chr1    1735000 1744999 -
chr1    1745000 1754999 -
chr1    1795000 1804999 -
chr1    1805000 1814999 +
chr1    1815000 1824999 -
chr1    1865000 1874999 -

ADD COMMENT • link 5.6 years ago by cpad0112 21k

1

Entering edit mode

in R with dataframe:

> df1=read.csv("test.txt", sep="\t", stringsAsFactors = F, strip.white = T, header = F)
> df2=df1[grep("\\+", df1$V4,value = F)-1,]
> df3=df2[grep("\\-", df2$V4,value = F),]
> df1[sort(c(as.integer(row.names(df3)),as.integer(row.names(df3))+1)),]
     V1      V2      V3 V4
5  chr1 1415000 1424999  -
6  chr1 1425000 1434999  +
12 chr1 1795000 1804999  -
13 chr1 1805000 1814999  +

ADD REPLY • link 5.6 years ago by cpad0112 21k

1

Entering edit mode

Much simpler with pcregrep (available in most of the linux repos):

$ pcregrep -M  '\-$\n.*\+$' test1.txt
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1795000 1804999 -
chr1    1805000 1814999 +

on *buntu, sudo apt install pcregrep

ADD REPLY • link 5.6 years ago by cpad0112 21k

1

Entering edit mode

with awk:

$ awk '/+/ {if(s == "-") {print $0}} {s = $4 }' test1.txt | grep --no-group-separator -f - -B 1 test1.txt 
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1795000 1804999 -
chr1    1805000 1814999 +

ADD REPLY • link 5.6 years ago by cpad0112 21k

1

Entering edit mode

+1 for a neat and elegant solution. Probably you could add the explanation of various flags for a novice user.

ADD REPLY • link 5.6 years ago by Santosh Anand 5.7k

score 3 · Accepted Answer · 2018-08-30

3

Entering edit mode

5.6 years ago

biocyberman ▴ 860

Haven't got an R solution yet, but these kind of manipulation I usually do with gawk/awk. Suppose your data file is in test.tsv. This is how you do it:

gawk 'BEGIN{pr=""; ps=""}(ps == "-" && $4 == "+" ) {print pr; print; pr =""; ps =""; next}{ pr = $0; ps=$4}' test.tsv    
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1795000 1804999 -
chr1    1805000 1814999 +

ADD COMMENT • link 5.6 years ago by biocyberman ▴ 860

1

Entering edit mode

It works as expected, thank you very much. awk is very very helpful to solve this kinds of problems, but for me takes a while to learn it...

ADD REPLY • link 5.6 years ago by Lila M ★ 1.2k

1

Entering edit mode

@Lila M: Make sure to check out the updated answer. I forgot to reset pr and ps (previous row, and previous sign) after printing. The updated version works correctly now.

ADD REPLY • link 5.6 years ago by biocyberman ▴ 860

0

Entering edit mode

thank you for the update!

ADD REPLY • link 5.6 years ago by Lila M ★ 1.2k

score 3 · Accepted Answer · 2018-08-30

3

Entering edit mode

5.6 years ago

Alex Reynolds 35k

You could just use awk:

$ awk -vFS="\t" -vOFS="\t" '{ if ($4 == "-") { curr = $0; flag = 1; } else if (($4 == "+") && (flag == 1)) { print curr; print $0; flag = 0; }  }' foo.bed
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1795000 1804999 -
chr1    1805000 1814999 +

If you want to use R, the same logic shown above would work by iterating/looping over rows in a data frame, testing $V4 at each row, and setting curr and flag variables accordingly.

ADD COMMENT • link 5.6 years ago by Alex Reynolds 35k

0

Entering edit mode

Do you know any tutorial or useful documentation to learn awk? I know that should be a separate question but I hope could be permitted

ADD REPLY • link 5.6 years ago by Lila M ★ 1.2k

1

Entering edit mode

You can find lot's of examples and other tutorial's on Grymoire.

ADD REPLY • link 5.6 years ago by harish ▴ 450

0

Entering edit mode

Thank you! I've never heard about this toturial before.

ADD REPLY • link 5.6 years ago by Lila M ★ 1.2k

score 3 · Accepted Answer · 2018-08-30

Here's one way to do this with R, which creates a temporary fifth column from shifting the fourth column by a row. The fourth and fifth columns can then be compared directly and used to build a new data frame:

> d <- read.table("foo.bed", header=F)
> r <- as.numeric(row.names(d))
> d$V5 <- d[r+1, ]$V4
> r2 <- as.numeric(row.names(d[d$V4 %in% "-" & d$V5 %in% "+", ]))
> a <- rbind(d[r2, ], d[r2+1, ])
> a2 <- a[order(as.numeric(row.names(a))), 1:4]
> print(a2)
     V1      V2      V3 V4
5  chr1 1415000 1424999  -
6  chr1 1425000 1434999  +
12 chr1 1795000 1804999  -
13 chr1 1805000 1814999  +