Biostar Beta. Not for public use.
Question: Creating Count Matrix
0
Entering edit mode

I have to create a count table with:

Gene ID   Sample1  Sample2  Sample3
ENSG...   297      0        0

So far I have files with their individual tables, it contains 4 columns with Gene ID, and the remaining three columns read counts. I want to take the second column from each sample and make a count matrix with each sample having their name. How can I make it so that each column has their respective sample?

Thanks

ADD COMMENTlink 14 months ago lamia_203 • 30 • updated 14 months ago ahaswer • 150
Entering edit mode
0

What language/method would you like to use? This can be done using shell, R, python, perl, etc.....you name it. What are you familiar with?

ADD REPLYlink 14 months ago
seidel
6.8k
Entering edit mode
0

Sorry I didn't specify. I'm familiar with R and Linux but prefer to make the count table in Linux since there are a lot of samples (about 3200).

ADD REPLYlink 14 months ago
lamia_203
• 30
3
Entering edit mode

If you are using linux you can also use paste and awk in terminal like so:

paste sample1 sample2 sample3 | awk 'BEGIN {OFS="\t"; FS="\t"}; {print $1','$2','$4','$6}' > count_matrix

It will concatenate specified tables horizontally and extract specified columns. Works for tab-delimited files. If delimiter is different just specify it after "OFS" and "FS".

ADD COMMENTlink 14 months ago ahaswer • 150
Entering edit mode
1

Since you have a lot of samples (I guess you keep them in separate catalogue) it would be much more convenient to avoid specifying desired columns. Therefore you can open terminal in samples catalogue and run (assuming there are only sample files):

paste * | awk 'BEGIN {OFS="\t"; FS="\t"}; {j=$1; for (i=2;i<=NF;i+=2) {j=j FS $i} print j}' > count_matrix

It will work as code above without selecting tons of columns ;)

ADD REPLYlink 14 months ago
ahaswer
• 150
Entering edit mode
0

This does use all the sample thank you. How does the loop for the i part work? When I ran this code, the count_matrix included two repeated columns from each sample rather than one.

Thanks

ADD REPLYlink 14 months ago
lamia_203
• 30
Entering edit mode
0

Maybe you have duplicates inside the catalogue or ran the code twice? Also check if the columns inside sample files are separated with one delimiter without multiplications. You can also use paste command selectively if, for instance, all sample files have the same extension:

paste *.txt | awk 'BEGIN {OFS="\t"; FS="\t"}; {j=$1; for (i=2;i<=NF;i+=2) {j=j FS $i} print j}' > count_matrix

The loop: starting from second column, if "i" equal or less than number of fields (i. e. columns), add 2. For each loop iteration append new column to "j" with specified field separator.

ADD REPLYlink 14 months ago
ahaswer
• 150
2
Entering edit mode

One suggestion using R would be:

# only return file names with a given pattern
dir(pattern="ReadsPerGene.out.tab")

# save the results to a variable
files <- dir(pattern="ReadsPerGene.out.tab")

counts <- c()
for( i in seq_along(files) ){
  x <- read.table(file=files[i], sep="\t", header=F, as.is=T)
  counts <- cbind(counts, x[,2])
}

# set the row names
rownames(counts) <- x[,1]
# set the column names based on input file names, with pattern removed
colnames(counts) <- sub("_ReadsPerGene.out.tab","",files)

This example assumes your results are each in a set of files with a pattern of ReadsPerGene.out.tab, as you might find using the STAR aligner.

ADD COMMENTlink 14 months ago seidel 6.8k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0