Question

File processing, arranging columns

0

Entering edit mode

6.3 years ago

AP ▴ 80

Hello everyone,

I have file1 which is tab delimited with following format :

g1  pfam    PF12    ABC transporter
g2  pfam    PF13    transcription factor
g3  pfam    PF14    glycosyl hydrolase
pfam    PF15    FAD binding domain  
g4  pfam    PF16    RTA1 like protein
pfam    PF17    Zinc-binding dehydrogenase  
pfam    PF18    major facilitator superfamily   
g5  pfam    PF19    short chain dehydrogenase
g6  pfam    PF20    ABC transporter

I want to arrange this file such that the lines beginning with pfam will include gene id(g3 or g4, etc.) from previous line. The output file that I want is also tab delimited and looks like this:

g1  pfam    PF12    ABC transporter
g2  pfam    PF13    transcription factor
g3  pfam    PF14    glycosyl hydrolase
g3  pfam    PF15    FAD binding domain
g4  pfam    PF16    RTA1 like protein
g4  pfam    PF17    Zinc-binding dehydrogenase
g4  pfam    PF18    major facilitator superfamily
g5  pfam    PF19    short chain dehydrogenase
g6  pfam    PF20    ABC transporter

Many thanks in advance.

Ambika

awk linux bash • 1.7k views

ADD COMMENT • link updated 6.3 years ago by cpad0112 21k • written 6.3 years ago by AP ▴ 80

score 1 · Answer 1 · 2017-12-21

1

Entering edit mode

6.3 years ago

5heikki 11k

Assuming that there's no empty first field:

awk 'BEGIN{OFS=FS="\t";first=""}{if(NF==4){first=$1;print $0}else{print first,$0}}' input > output

If first field is empty:

awk 'BEGIN{OFS=FS="\t";first=""}{if($1!=""){first=$1; print $0}else{print first,$2,$3,$4}}' input > output

ADD COMMENT • link 6.3 years ago by 5heikki 11k

0

Entering edit mode

Nice one-liner. One question: You don't really nead first="" in the BEGIN statement?

ADD REPLY • link 6.3 years ago by e.rempel ★ 1.1k

0

Entering edit mode

Thank you. I will try that.

ADD REPLY • link 6.3 years ago by AP ▴ 80

score 1 · Answer 2 · 2017-12-21

output using sed and awk:

$ sed 's/^pfam/\tpfam/g' test.txt |awk -F "\t" -v OFS="\t" '{if($1=="") {$1=previous} previous=$1}1'  
g1  pfam    PF12    ABC transporter
g2  pfam    PF13    transcription factor
g3  pfam    PF14    glycosyl hydrolase
g3  pfam    PF15    FAD binding domain  
g4  pfam    PF16    RTA1 like protein
g4  pfam    PF17    Zinc-binding dehydrogenase  
g4  pfam    PF18    major facilitator superfamily   
g5  pfam    PF19    short chain dehydrogenase
g6  pfam    PF20    ABC transporter

input (tab separated):

$ cat test.txt 
g1  pfam    PF12    ABC transporter
g2  pfam    PF13    transcription factor
g3  pfam    PF14    glycosyl hydrolase
pfam    PF15    FAD binding domain  
g4  pfam    PF16    RTA1 like protein
pfam    PF17    Zinc-binding dehydrogenase  
pfam    PF18    major facilitator superfamily   
g5  pfam    PF19    short chain dehydrogenase
g6  pfam    PF20    ABC transporter

score 0 · Answer 3 · 2017-12-20

This should work in python as long as you replace the "input.txt" and "output.txt" with the paths to your input file and desired output file.

f = open("input.txt", "r")
o = open("output.txt", "w")

lists = []

for line in f.readlines():
        temp_list = []
        index = 3
        if line.startswith("p"):
                temp_list.append(prev_g)
                index = 2
        for j in line.split()[:index]:
                temp_list.append(j)
        temp_list.append(" ".join(line.split()[index:]))
        lists.append(temp_list)
        if line.startswith("g"):
                prev_g = line.split()[0]

for i in lists:
        for j in i:
                o.write(str(j))
                o.write("\t")
        o.write("\n")