Biostars beta testing.
Question: Extract a word from inside the text
0
Entering edit mode

Hi All Dear,

I have a text file, like the following file. I want to extract the name of the genes.

for example:

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

from below input:

ID=id18056;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XM_006073960.2;gbkey=mRNA;gene=NECTIN3;product=nectin cell adhesion molecule
ID=id18065;Parent=rna1457;Dbxref=GeneID:102398777,Genbank:XR_003108818.1;gbkey=misc_RNA;gene=TAGLN3;product=nectin cell adhesion molecule
ID=cds1149;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XP_006074022.1;Name=XP_006074022.1;gbkey=CDS;gene=SMG6;product=nectin-3;protein
ID=id18057;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XM_006073960.2;gbkey=mRNA;gene=ERICH1;product=nectin cell adhesion molecule
ID=id18066;Parent=rna1457;Dbxref=GeneID:102398777,Genbank:XR_003108818.1;gbkey=misc_RNA;gene=DLGAP2;product=nectin cell adhesion molecule
ID=cds1149;Parent=rna1456;Dbxref=GeneID:102398777,Genbank:XP_006074022.1;Name=XP_006074022.1;gbkey=CDS;gene=PPP2R2B;product=nectin-3;protein

What is the best idea?

ADD COMMENTlink 15 months ago mostafarafiepour • 60 • updated 15 months ago cpad0112 11k
3
Entering edit mode

To extract gene names:

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 
NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

To extract unique gene names

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 | sort | uniq
ADD COMMENTlink 15 months ago Pierre Lindenbaum 120k • updated 15 months ago zx8754 7.5k
Entering edit mode
0

many thanks for all answer ...

All answers were great

ADD REPLYlink 15 months ago
mostafarafiepour
• 60
Entering edit mode
0

Now, I've Extract the name of the genes. But there is a problem, because a gene may be in different positions, So its name is copied several times.

Is there a suggestion?

ADD REPLYlink 15 months ago
mostafarafiepour
• 60
Entering edit mode
0
sort | uniq

.

ADD REPLYlink 15 months ago
Pierre Lindenbaum
120k
Entering edit mode
0

sorry, How to use sort | uniq?

Do you mean to add it to the previous script?

tr ";*" "\n" < input.txt | grep  '^gene='  | cut -d '=' -f2 sort | uniq
ADD REPLYlink 15 months ago
mostafarafiepour
• 60
Entering edit mode
4

mostafarafiepour, with all due respect: Invest time and search for these absolutely basic answers yourself. This is a bioinformatics Q&A community, intended to help with bioinformatics-related problems, not a basic Unix learning platform. You are lucky people actually answer these kinds of questions. Again, with respect, but if you are already stuck with these most simple things, I am worried that you will run into some severe trouble once analysis gets beyond executing basic Unix scripts. Learn the basics first, plenty of open-source material online on this.

ADD REPLYlink 15 months ago
ATpoint
17k
1
Entering edit mode

Hi

use these code:

# suppose df is your table
df <- gsub("*.gene=","",df)
df <- gsub("[*].*,"",df)

or make delimiter based on ** chars.

ADD COMMENTlink 15 months ago ahmad mousavi • 430
Entering edit mode
0

I modified the text file. Before and after the gene, is not **.

ADD REPLYlink 15 months ago
mostafarafiepour
• 60
1
Entering edit mode

Super fast and easy using grep pattern matching using regex

grep -P '(?<=\*\*gene=)\w+(?=\*\*)' -o gene.txt

where gene.txt if your file name

Output

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B

Explanation

-P Means pattern

?<= Left Anchor

?= Right anchor

-o Output only what matched

ADD COMMENTlink 15 months ago Vijay Lakhujani 4.1k
Entering edit mode
0

Excuse me, what is the input file? You only specify the output.

ADD REPLYlink 15 months ago
mostafarafiepour
• 60
Entering edit mode
0

gene.txt is the input file. output is thrown to standard output (stdout)

ADD REPLYlink 15 months ago
Vijay Lakhujani
4.1k
Entering edit mode
0

Why do you have the ** in the positive lookbehind assertion? And why do you need the positive lookahead assertion? grep -oP "(? <=gene=)[^;]+" will suffice, no? EDIT: cpad is correct, we don't even need the [^;], this will suffice: grep -oP '(?<=;gene=)\w+' EDIT2: Turns out OP changed data after posting a snippet with the **.

ADD REPLYlink 15 months ago
RamRS
21k
Entering edit mode
0

I think your script should change this way.

grep -P '(?<=\;gene=)\w+(?=\;)' -o gene.txt
ADD REPLYlink 15 months ago
mostafarafiepour
• 60
Entering edit mode
0

I get the single quotes and the inclusion of a semi-colon to account for other attributes that may end in gene=, but why include a positive lookahead for a semi-colon?

Also, grep -oP <pattern> <file> is equivalent to grep -P <pattern> -o file, as neither -o not -P is a positional argument.

ADD REPLYlink 15 months ago
RamRS
21k
Entering edit mode
0

or this : grep -oP "(?<=gene=)\w+" test.txt ?

ADD REPLYlink 15 months ago
cpad0112
11k
Entering edit mode
0

Or yes, this. I'd forgotten that \w does not match ;. Thanks, cpad!

ADD REPLYlink 15 months ago
RamRS
21k
0
Entering edit mode
$ sed 's/.*gene=\(\w\+\);.*/\1/g' test.txt 

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B
ADD COMMENTlink 15 months ago cpad0112 11k
Entering edit mode
1

with awk:

$ awk -F'gene=|;prod' '{print $2}' test.txt

or

$ awk 'gsub(/.*gene=|;product.*/,"")' test.txt 

NECTIN3
TAGLN3
SMG6
ERICH1
DLGAP2
PPP2R2B
ADD REPLYlink 15 months ago
cpad0112
11k
Entering edit mode
0

sed -r is your friend :-)

sed -r 's/.*gene=(\w+);.*/\1/g' test.txt

Although you may wish to add a ; before gene and omit the .* after the second ; :-)

ADD REPLYlink 15 months ago
RamRS
21k
Entering edit mode
1

I guess you have posted several times about -r option and I keep forgetting using it. Thanks RamRS

ADD REPLYlink 15 months ago
cpad0112
11k
Entering edit mode
0

Not several, maybe just once more. Once you go -r, you never go back. It's like grep -E. So handy and convenient, makes you wonder why plain grep even exists :-)

ADD REPLYlink 15 months ago
RamRS
21k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0