Biostar Beta. Not for public use.
Question: Splitting string in a column using character
0
Entering edit mode

I'm trying to parse values present in rows in a column to two parts using a specific string as parser. But, unable to parse it, most of online available examples uses delimiter for their examples, but I want a small string (two letters) to act as parser. Is it recommended to do it using awk & sed ? Example:

Col1
BOT-rs10136766
BOT-rs104894363
BOT-rs10774624
BOT-rs111647200
GSA-rs117306900
GSA-rs117306950
GSA-rs117306954
GSA-rs117306975
GSA-rs117306989
BOT-seq-rs532891158.1
BOT-seq-rs794728599
DUP-rs121913344
DUP-rs12979860
DUP-seq-rs397518008
DUP-seq-rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
seq-rs794727444.1
seq-rs794727773.1
seq-rs794728252.1
seq-rs794728252.2

Here, I want to parse only rsID (rs followed with numericID) to be parsed separately from the prefixes.

ADD COMMENTlink 11 months ago sofie_carolina • 20 • updated 11 months ago zx8754 7.5k
Entering edit mode
1
sed 's/.*\(rs\w\+\).*/\1/g' test.txt
Col1
rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444
rs794727773
rs794728252
rs794728252
ADD REPLYlink 11 months ago
cpad0112
11k
Entering edit mode
0

Maybe move to answer?

ADD REPLYlink 11 months ago
zx8754
7.5k
Entering edit mode
0

Guessing from .1, .2 suffixes, is this an output from an R script?

ADD REPLYlink 11 months ago
zx8754
7.5k
2
Entering edit mode
grep -P 'rs\d+\.?\d+?' test.txt -o

where test.txt is the file containing the ids you have mentioned above

output

rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158.1
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444.1
rs794727773.1
rs794728252.1
rs794728252.2
ADD COMMENTlink 11 months ago Vijay Lakhujani 4.1k
Entering edit mode
0

How to define col here, If I wish to give col ID = 1 ? And also I don't need integers present after decimal ? Like in some rs'ids I have .1, .2 .. Don't need them. Can we mention these two things in your script ?

ADD REPLYlink 11 months ago
sofie_carolina
• 20
Entering edit mode
0

Can you paste an example how should your result look like?

ADD REPLYlink 11 months ago
Vijay Lakhujani
4.1k
Entering edit mode
0

I think they just want rsXXX, drop prefixes anything before and including dash, and suffixes anything after including dot (.) .

ADD REPLYlink 11 months ago
zx8754
7.5k
Entering edit mode
0
$grep -Po '(?<=^|-)rs\w*' test.txt  
rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444
rs794727773
rs794728252
rs794728252
ADD REPLYlink 11 months ago
cpad0112
11k
Entering edit mode
0

try this

grep -P 'rs\d+' test.txt -o
ADD REPLYlink 11 months ago
Vijay Lakhujani
4.1k
Entering edit mode
0

I have rsid's in col2. Where to specify col name in this script ?

ADD REPLYlink 11 months ago
sofie_carolina
• 20
Entering edit mode
0

I don't want to fetch rsid's to another file. I want to print the o/p in the same col. Where rs not found that row will not be printed or it will be omitted.

ADD REPLYlink 11 months ago
sofie_carolina
• 20

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0