Biostar Beta. Not for public use.
Question: Extract Information from fasta File
0
Entering edit mode

I have big fasta file with information like this,

>ENST00000448914.1 cdna chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
ACTGGGGGATACG
>ENST00000631435.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
>ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC

I want to retrieve the transcript ID that is the first entry and then i want to extract the gene Id (fourth entry) from the file to a new file. I know it works something like this

zcat file.fa.gz | grep ">" | perl -lane 'if***************{print join("\t", $1, $4)}' > transcripts2genes.

What I dont know is what come on the part of the asterisk. Can somebody help me with this?

i want the output be like

ENST00000448914.1  ENSG00000228985.1
ENST00000631435.1  ENSG00000282253.1
ADD COMMENTlink 3.3 years ago saamar.rajput • 20 • updated 3.3 years ago Charles Plessy ♦ 2.7k
Entering edit mode
0

Can you provide an example of the desired output?

ADD REPLYlink 3.3 years ago
pld
4.8k
4
Entering edit mode

Here's one way to do this:

zcat myfile.fa.gz | grep ">" | cut -f 1,4 -d" " | tr -d '>gene:' | tr ' ' '\t'

Input

>ENST00000448914.1 cdna chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
ACTGGGGGATACG
>ENST00000631435.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
>ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC

Output

ENST00000448914.1       ENSG00000228985.1
ENST00000631435.1       ENSG00000282253.1
ENST00000632684.1       ENSG00000282431.1

EDIT: Forgot that last call to tr in the original post

ADD COMMENTlink 3.3 years ago pld 4.8k
Entering edit mode
0

Shouldn't the tr not simply be 'gene:' instead of '>gene:'?

ADD REPLYlink 3.3 years ago
WouterDeCoster
39k
Entering edit mode
1

Nope, here's the output of cut:

>ENST00000448914.1 gene:ENSG00000228985.1
>ENST00000631435.1 gene:ENSG00000282253.1
>ENST00000632684.1 gene:ENSG00000282431.1
ADD REPLYlink 3.3 years ago
pld
4.8k
Entering edit mode
0

Oh yes, you're completely right. I had a not-so-bright moment apparently :-D

ADD REPLYlink 3.3 years ago
WouterDeCoster
39k
2
Entering edit mode

Can you check if this is the desired output?

I'm not sure if you want to keep the sequence information...

paste <(zcat file.fa.gz | cut -f1 -d' ') <(zcat file.fa.gz | cut -f4 -d' ' | sed 's/gene://') > output.fa

Note that you can pipe it into gzip again to get a .gz fasta file again if you would want to.

ADD COMMENTlink 3.3 years ago WouterDeCoster 39k
Entering edit mode
0

this is the output

ENST00000448914.1 ENSG00000228985.1 ACTGGGGGATACG ACTGGGGGATACG ENST00000631435.1 ENSG00000282253.1 GGGACAGGGGGC GGGACAGGGGGC ENST00000632684.1 ENSG00000282431.1 GGGACAGGGGGC GGGACAGGGGGC i do not need the sequence after the gene information but it shows in the output file.

ADD REPLYlink 3.3 years ago
saamar.rajput
• 20
Entering edit mode
0

In that case the answer from joe.cornish826 is most optimal :-)

ADD REPLYlink 3.3 years ago
WouterDeCoster
39k
2
Entering edit mode

Probably

grep '^>' infile | sed -e 's/^>//' -e 's/ .*gene://' | sed 's/ gene_biotype.*//'
ADD COMMENTlink 3.3 years ago Rohit ♦ 1.3k
2
Entering edit mode

In Perl:

zcat file.fa.gz | perl -anE 'next unless />/ ; $F[0] =~ s/>// ; $F[3] =~ s/gene:// ; say join "\t", $F[0], $F[3]'
ADD COMMENTlink 3.3 years ago Charles Plessy ♦ 2.7k
Entering edit mode
1

a more compact perl alternative:

zcat file.fa.gz | perl -ne '/^>(\S+).+ gene:(\S+)/ and print "$1\t$2\n"'
ADD REPLYlink 3.3 years ago
Jorge Amigo
11k
Entering edit mode
0

Slightly more compact: zcat file.fa.gz | perl -nE '/^>(\S+).+ gene:(\S+)/ and say "$1\t$2"' :)

ADD REPLYlink 3.3 years ago
Charles Plessy
♦ 2.7k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0