Merging csv files with specific headers
0
0
Entering edit mode
18 months ago
windsur ▴ 20

Hello all,

I have a directory with csv files (tab delimited). Each file have different sections (4 headers):

miRDeep2 score  novel miRNAs reported by miRDeep2   novel miRNAs, estimated false positives novel miRNAs, estimated true positives  known miRNAs in species known miRNAs in data    known miRNAs detected by miRDeep2   estimated signal-to-noise   excision gearing

novel miRNAs predicted by miRDeep2                                                              
provisional id  miRDeep2 score  estimated probability that the miRNA candidate is a true positive   rfam alert  total read count    mature read count   loop read count star read count significant randfold p-value    miRBase miRNA   example miRBase miRNA with the same seed    UCSC browser    NCBI blastn consensus mature sequence   consensus star sequence consensus precursor sequence    precursor coordinate

mature miRBase miRNAs detected by miRDeep2                                                              
tag id  miRDeep2 score  estimated probability that the miRNA is a true positive rfam alert  total read count    mature read count   loop read count star read count significant randfold p-value    mature miRBase miRNA    example miRBase miRNA with the same seed    UCSC browser    NCBI blastn consensus mature sequence   consensus star sequence consensus precursor sequence    precursor coordinate

#miRBase miRNAs not detected by miRDeep2                                        
miRBase precursor id    total read count    mature read count(s)    star read count remaining reads UCSC browser    NCBI blastn miRBase mature sequence(s)  miRBase star sequence(s)    miRBase precursor sequence  

Each row for each section has its own data. I am interested in 2 sections (novel miRNAs predicted by miRDeep2 and mature miRBase miRNAs detected by miRDeep2).

On the first place I would like to generate a file of frequencies of mature miRBase miRNAs detected by miRDeep2 for the mature miRBase miRNA column:

mature miRBase miRNA        Freq    Filenames
hsa-miR-486-5p  154 file1,file2,file3...
hsa-miR-93-5p   135 file1,file4,file5...
hsa-let-7i-5p   210 file2,file4,file5...

Note: If a file has the same maturemiRBase miRNA repetead, it should counts as one

And in the other hand is the same but for the novel miRNAs predicted by miRDeep2 and its column provisional id

An example of the data:

File1:

miRDeep2 score  novel miRNAs reported by miRDeep2   novel miRNAs, estimated false positives novel miRNAs, estimated true positives  known miRNAs in species known miRNAs in data    known miRNAs detected by miRDeep2   estimated signal-to-noise   excision gearing
10  1   0 +/- 1 1 +/- 0 (66 +/- 48%)    2656    683 85 (12%)    20.8    1
9   1   0 +/- 1 1 +/- 0 (66 +/- 48%)    2656    683 87 (13%)    21  1
8   2   0 +/- 1 2 +/- 1 (79 +/- 31%)    2656    683 87 (13%)    21.2    1



novel miRNAs predicted by miRDeep2                                                              
provisional id  miRDeep2 score  estimated probability that the miRNA candidate is a true positive   rfam alert  total read count    mature read count   loop read count star read count significant randfold p-value    miRBase miRNA   example miRBase miRNA with the same seed    UCSC browser    NCBI blastn consensus mature sequence   consensus star sequence consensus precursor sequence    precursor coordinate
22_4891 40.2    66 +/- 48%  -   75  72  0   3   yes -   -   -   -   cacugcaaccucugccuccggu  ugagaggcagagguugcagugg  ugagaggcagagguugcaguggcacgaucucaggucacugcaaccucugccuccggu   22:20012651..20012708:-
2_775   8.6 79 +/- 31%  -   14  10  0   4   yes -   -   -   -   gaagacagucgaacuugacu    uuuagugaggcccucggaucagc uuuagugaggcccucggaucagcccgcugggucagcccacugcccuggcggaacgcugagaagacagucgaacuugacu 2:133011980..133012059:-
22_4800 4.8 78 +/- 26%  -   7   4   0   3   yes -   -   -   -   cccuccucuccuguggccacaga ugguccaacgacaggaguagg   ugguccaacgacaggaguaggcuuguauuuaaaagcggccccuccucuccuguggccacaga  22:20052702..20052764:+



mature miRBase miRNAs detected by miRDeep2                                                              
tag id  miRDeep2 score  estimated probability that the miRNA is a true positive rfam alert  total read count    mature read count   loop read count star read count significant randfold p-value    mature miRBase miRNA    example miRBase miRNA with the same seed    UCSC browser    NCBI blastn consensus mature sequence   consensus star sequence consensus precursor sequence    precursor coordinate
8_2073  4167777.2   66 +/- 48%  -   8174911 8174878 0   33  yes hsa-miR-486-5p  -   -   -   uccuguacugagcugccccgag  cggggcagcucaguacaggau   uccuguacugagcugccccgaggcccuucaugcugcccagcucggggcagcucaguacaggau 8:41517960..41518023:-
8_1990  4167744.1   66 +/- 48%  -   8174846 8174816 0   30  yes hsa-miR-486-5p  -   -   -   uccuguacugagcugccccgag  cggggcagcucaguacaggau   uccuguacugagcugccccgagcugggcagcaugaagggccucggggcagcucaguacaggau 8:41517961..41518024:+
12_2896 11840.9 66 +/- 48%  -   23222   23209   0   13  yes hsa-let-7i-5p   -   -   -   ugagguaguaguuugugcuguu  cugcgcaagcuacugccuug    ugagguaguaguuugugcuguuggucggguugugacauugcccgcuguggagauaacugcgcaagcuacugccuug    12:62997470..62997546:+

#miRBase miRNAs not detected by miRDeep2                                                    
miRBase precursor id    total read count    mature read count(s)    star read count remaining reads UCSC browser    NCBI blastn miRBase mature sequence(s)  miRBase star sequence(s)    miRBase precursor sequence              
hsa-mir-451a    3801    3800    0   1   -   -   aaaccguuaccauuacugaguu  -   cuugggaauggcaaggaaaccguuaccauuacugaguuuaguaaugguaaugguucucuugcuauacccaga                
hsa-mir-941-5   914 914 0   0   -   -   cacccggcugugugcacaugugc -   ugugcacaugugcccagggcccgggacagcgccacggaagaggacgcacccggcugugugcacaugugccca                
hsa-mir-574 239 239 0   0   -   -   cacgcucaugcacacacccaca  ugagugugugugugugagugugu -   gggaccugcgugggugcgggcgugugagugugugugugugagugugugucgcuccggguccacgcucaugcacacacccacacgcccacacucagg            

File2:

miRDeep2 score  novel miRNAs reported by miRDeep2   novel miRNAs, estimated false positives novel miRNAs, estimated true positives  known miRNAs in species known miRNAs in data    known miRNAs detected by miRDeep2   estimated signal-to-noise   excision gearing
10  0   0 +/- 0 0 +/- 0 (0 +/- 0%)  2656    686 89 (13%)    25.5    1
9   0   0 +/- 0 0 +/- 0 (0 +/- 0%)  2656    686 89 (13%)    25.3    1
8   2   0 +/- 1 2 +/- 1 (78 +/- 32%)    2656    686 91 (13%)    25.8    1



novel miRNAs predicted by miRDeep2                                                  
provisional id  miRDeep2 score  estimated probability that the miRNA candidate is a true positive   rfam alert  total read count    mature read count   loop read count star read count significant randfold p-value    miRBase miRNA   example miRBase miRNA with the same seed    UCSC browser    NCBI blastn consensus mature sequence
15_3320 8.9 78 +/- 32%  -   22  21  0   1   yes -   -   -   -   agguagauagaacaggucuugu
18_4183 8.8 78 +/- 32%  -   16  15  0   1   yes -   -   -   -   cuucgaaagcggcuucggcu
9_2330  4.5 77 +/- 26%  -   13  12  0   1   no  -   -   -   -   ccagcccuguuccccaccccgc



mature miRBase miRNAs detected by miRDeep2                                                  
tag id  miRDeep2 score  estimated probability that the miRNA is a true positive rfam alert  total read count    mature read count   loop read count star read count significant randfold p-value    mature miRBase miRNA    example miRBase miRNA with the same seed    UCSC browser    NCBI blastn consensus mature sequence
8_1970  3713910.6   0 +/- 0%    -   7284671 7284476 0   195 yes hsa-miR-486-5p  -   -   -   uccuguacugagcugccccgag
8_2047  3712258.7   0 +/- 0%    -   7281431 7281218 2   211 yes hsa-miR-486-5p  -   -   -   uccuguacugagcugccccgag
13_3063 196960.2    0 +/- 0%    -   386326  386325  0   1   yes hsa-miR-92a-1-5p    -   -   -   uauugcacuugucccggccugu                                      

#miRBase miRNAs not detected by miRDeep2                                        
    miRBase precursor id    total read count    mature read count(s)    star read count remaining reads UCSC browser    NCBI blastn miRBase mature sequence(s)  miRBase star sequence(s)    miRBase precursor sequence  
    hsa-mir-451a    4971    4971    0   0   -   -   aaaccguuaccauuacugaguu  -   cuugggaauggcaaggaaaccguuaccauuacugaguuuaguaaugguaaugguucucuugcuauacccaga    
    hsa-mir-941-5   464 464 0   0   -   -   cacccggcugugugcacaugugc -   ugugcacaugugcccagggcccgggacagcgccacggaagaggacgcacccggcugugugcacaugugccca    
    hsa-mir-1260b   70  70  0   0   -   -   aucccaccacugccaccau -   ucuccguuuaucccaccacugccaccauuauugcuacuguucagcaggugcugcugguggugauggugauagucuggugggggcggugg   

Any help is more than welcome! Thanks!

merge csv R • 666 views
ADD COMMENT
0
Entering edit mode

I'd use a one liner in Perl to extract the chunk of interest (starting with the constant header line and ending with blank line). In R, I'd apply it to the files as thecmd option to datatable::fread, then combine each file's results with datatable::rbindlist, then build a matrix of counts using datatable;:dcast.

ADD REPLY
0
Entering edit mode

thanks, but I'm not familiar with perl, could you help me with the code? thanks

ADD REPLY

Login before adding your answer.

Traffic: 2307 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6