Convert two-column txt to multiple column csv
3
3
Entering edit mode
5.4 years ago
misterie ▴ 110

Hi,

I need some advice. I have output file (count file) from VCF format. It looks like this:

Chr 10
protein_coding 447164
pseudogene 87457
Chr 11
protein_coding 368825
pseudogene 78131
Chr 12
protein_coding 357596
pseudogene 68176

and there are more chromosomes. I have two others files with another column names (but could differ with 1 or more fields between chromosomes). How can I convert that file to CSV or another file format. I mean, I want to create file like this:

Chr,protein_coding,pseudogene
10,447164,87457
11,368825,78131
12,357596,68176

Assuming that if some chromosome does not has for example pseudogene, than script will put empty field, e.g. for 15 chromosome:

15,132598,

Thank you in advance

bash python r conversion • 1.6k views
ADD COMMENT
0
Entering edit mode

Thank you for explaining the problem at hand so well.

What have you tried by yourself to solve this problem? How far did you get and what specific challenges are you facing?

ADD REPLY
0
Entering edit mode

I have no idea, how can I do that...

ADD REPLY
1
Entering edit mode

You can use awk with RS=Chr. That will use Chr to create records, so each record would contain all data between consecutive Chrs. You can then replace each new line by a space and use $1, $2 etc to get to your result.

ADD REPLY
0
Entering edit mode

Also, please see the following oddities:

  • protein_coding becomes protein_codin
  • There's a blank space preceding pseudogene. Why?
  • Your example for "15 chromosome" has entry for Chr = 16
  • Your current dataset has nothing separating Chr (the key) from the chromosome number (the value). Is this true?
  • The comma-separated format you've shown is actually semi-colon separated. You do mention "or other", so that's fine I guess, but pick a separator.

These will make a difference in the final script you develop.

ADD REPLY
0
Entering edit mode

sorry, my bad. I have corrected everything.

ADD REPLY
4
Entering edit mode
5.4 years ago
zx8754 11k

Using R dplyr package:

library(dplyr)
library(tidyr) # spread: reshape long-to-wide

# example data (using @RamRS's data)
df1 <- read.table(text = "Chr 10
protein_coding 447164
pseudogene 87457
Chr 11
protein_coding 368825
pseudogene 78131
Chr 12
protein_coding 357596
pseudogene 68176
Chr 13
protein_coding 447164
Chr 14
pseudogene 87457
Chr 15", header = FALSE, stringsAsFactors = FALSE)

write.csv(
  df1 %>% 
    mutate(grp = cumsum(V1 == "Chr")) %>% 
    spread(key = "V1", value = "V2") %>% 
    select(-grp),
  file = "myOutput.csv", na = "", row.names = FALSE, quote = FALSE)

# myOutput.csv
#
# Chr,protein_coding,pseudogene
# 10,447164,87457
# 11,368825,78131
# 12,357596,68176
# 13,447164,
# 14,,87457
# 15,,
ADD COMMENT
0
Entering edit mode

Neat trick, using a conditional cumsum as a group identifier!

ADD REPLY
1
Entering edit mode
5.4 years ago
Ram 43k

Here's a simple bash-awk script I wrote:

$ cat in.txt

Chr 10
protein_coding 447164
pseudogene 87457
Chr 11
protein_coding 368825
pseudogene 78131
Chr 12
protein_coding 357596
pseudogene 68176
Chr 13
protein_coding 447164
Chr 14
pseudogene 87457
Chr 15

$ echo "Chr,protein_coding,pseudogene" && awk 'BEGIN{RS="Chr "} {gsub(/\n/," ",$0); print $0}' in.txt | awk -F" " 'NF==5 {print $1 "," $3 "," $5} NF==3 && $2=="protein_coding" {print $1 "," $3 ","} NF==3 && $2=="pseudogene" {print $1 ",," $3} NF==1 {print $1 ",,"}'

Chr,protein_coding,pseudogene
10,447164,87457
11,368825,78131
12,357596,68176
13,447164,
14,,87457
15,,

The code makes a lot of assumptions and is as unscalable and data-dependent as code gets, but it works. It assumes missing entries would not have corresponding key (first column) entries. It assumes Chr is mandatory. It assumes no blank spaces exist in values, and that the three listed keys are the only allowed keys. I hate this code, but it works.

ADD COMMENT
1
Entering edit mode
5.4 years ago

I think this python script should do the trick, but I haven't tested it

import sys

data = dict(Chr= 'Chr', protein_coding= 'protein_coding', pseudogene = 'pseudogene')
for line in open(sys.argv[1]):
    if line.startswith('Chr'):
        print("{},{},{}".format(*data.values()))
        data = dict(Chr= '', protein_coding= '', pseudogene = '')
    data[line.split('\t')[0]] = line.strip().split('\t')[1]
ADD COMMENT

Login before adding your answer.

Traffic: 1703 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6