out of memory in R after run ChIPSeeker
2
0
Entering edit mode
7.3 years ago
Lila M ★ 1.2k

Hi everybody, I've tried to make some TSS profiles as follow:

Normalized Bam files using bamCoverage (output NGS974Norm bedgraph)

running ChiPseeker in bioconductor:

reads <- import.bed(con="NGS974Norm") print(reads) cov <- coverage(reads) peakAnno <- annotatePeak(reads, tssRegion=c(-500, 2000), TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene, annoDb="org.Hs.eg.db")

But when it finish reports the following error: Error in rsqlite_send_query(conn@ptr, statement) : out of memory

any ideas?

ChIP-Seq R ChIPseeker • 3.3k views
ADD COMMENT
1
Entering edit mode

How big is your dataset ? How much RAM do you have on your computer ? Have you tried running annotatePeak on only a few genes and not the full annotation (TxDb.Hsapiens.UCSC.hg19.knownGene) ?

ADD REPLY
0
Entering edit mode

I think this is not the problem because I have Memory 31.1GiB Processor Cor i7-6700 CPU @ 3.40GHz × 8

I need to run it woth all the TSS of all ensembl protein coding transcripts. any Idea?

ADD REPLY
0
Entering edit mode

When exactly do you get the error ?

After importing the bed file or when running annotatePeak ? If it is the later, at which step does it crash ? Loading peak file ? calculating distances ?

ADD REPLY
0
Entering edit mode

Is just after "adding gene annotation... " could it be the file?

ADD REPLY
0
Entering edit mode

That's weird...You can always try to test the code on smaller datasets (only 1 chrom of the bedgraph file or only 1 gene of the annotation for instance), but honestly, I don't know what is going on.

ADD REPLY
1
Entering edit mode

I've solved the problem, ChIPseeker need a file with 6 columns, not with 4 as bedgraph files report.

ADD REPLY
0
Entering edit mode

can you send me a sample file (e.g. first 20 rows of your bedgraph file) for testing?

ADD REPLY
0
Entering edit mode

Hi, yes it this:

"chr1" 0 10000 0 1 "." "chr1" 10000 10050 55 2 "." "chr1" 10050 10100 91 3 "." "chr1" 10100 10150 70 4 "." "chr1" 10150 10200 71 5 "." "chr1" 10200 10250 68 6 "." "chr1" 10250 10300 51 7 "."" "chr1" 10450 10500 14 11 "."

ADD REPLY
0
Entering edit mode

It would be better to send an attached file instead of pasting file content.

I don't think it's the number of column issue. I tested and it actually works fine with only 4 columns.

> f = getSampleFiles()[[4]]
> p = readPeakFile(f)
> p$V4 = NULL
> p$V5 = NULL
> p
GRanges object with 1331 ranges and 0 metadata columns:
         seqnames                 ranges strand
            <Rle>              <IRanges>  <Rle>
     [1]     chr1     [ 815093,  817883]      *
     [2]     chr1     [1243288, 1244338]      *
     [3]     chr1     [2979977, 2981228]      *
     [4]     chr1     [3566182, 3567876]      *
     [5]     chr1     [3816546, 3818111]      *
     ...      ...                    ...    ...
  [1327]     chrX [135244783, 135245821]      *
  [1328]     chrX [139171964, 139173506]      *
  [1329]     chrX [139583954, 139586126]      *
  [1330]     chrX [139592002, 139593238]      *
  [1331]     chrY [ 13845134,  13845777]      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

> a = annotatePeak(p, annoDb='org.Hs.eg.db')
>> preparing features information...         2016-12-21 19:12:27
>> identifying nearest features...       2016-12-21 19:12:27
>> calculating distance from peak to TSS...  2016-12-21 19:12:27
>> assigning genomic annotation...       2016-12-21 19:12:27
>> adding gene annotation...             2016-12-21 19:12:28
'select()' returned many:many mapping between keys and columns
>> assigning chromosome lengths          2016-12-21 19:12:28
>> done...                   2016-12-21 19:12:28
Warning message:
In loadTxDb(TxDb) :
  >> TxDb is not specified, use 'TxDb.Hsapiens.UCSC.hg19.knownGene' by default...
> a
Annotated peaks generated by ChIPseeker
1331/1331  peaks were annotated
Genomic Annotation Summary:
              Feature  Frequency
9    Promoter (<=1kb) 48.1592787
10   Promoter (1-2kb)  6.0856499
11   Promoter (2-3kb)  4.5078888
4              5' UTR  0.3005259
3              3' UTR  2.1036814
1            1st Exon  0.6010518
7          Other Exon  2.4042074
2          1st Intron  2.1036814
8        Other Intron  5.8602554
6  Downstream (<=3kb)  0.9767092
5   Distal Intergenic 26.8970699

ADD REPLY
0
Entering edit mode

A thought that comes to my mind: Just because your computer has 32 GB RAM does not mean that 32GB Ram are also available to R. So it might be useful to just go back to the original error msg and try to investigate. Also the error msg talk about rsqllite, so maybe this module ran out of memory? As mentioned before, I'd try it with less data see if that helps, also monitor the memory usage when you start the process. Sorry can not be more specific than that. .... but 'all protein coding transcripts in ensembl' ... that sounds like something that could push any machine to its limits.

ADD REPLY
0
Entering edit mode

(posted twice sorry)

ADD REPLY
0
Entering edit mode
7.3 years ago
Guangchuang Yu ★ 2.6k

@Lila M had sent me the original file which is about 1.2Gb. It's indeed memory intensive in ID converting step. I had optimized this step and it runs more faster and required less memory.

The optimized version is available on github: https://github.com/GuangchuangYu/ChIPseeker.

It only took 19 second to annotate the peaks (31027455 records, 1.2Gb):

> p = readPeakFile("~/Downloads/peak_NGS974")
> a=annotatePeak(p, annoDb='org.Hs.eg.db')
>> preparing features information...         2016-12-21 20:53:31
>> identifying nearest features...         2016-12-21 20:53:31
>> calculating distance from peak to TSS...     2016-12-21 21:08:42
>> assigning genomic annotation...         2016-12-21 21:08:42
>> adding gene annotation...             2016-12-21 21:11:52
>> assigning chromosome lengths             2016-12-21 21:12:35
>> done...                     2016-12-21 21:12:35

Warning message:
In loadTxDb(TxDb) :
  >> TxDb is not specified, use 'TxDb.Hsapiens.UCSC.hg19.knownGene' by default...
>
> a
Annotated peaks generated by ChIPseeker
31027137/31027455  peaks were annotated
Genomic Annotation Summary:
              Feature  Frequency
9    Promoter (<=1kb)  2.7942862
10   Promoter (1-2kb)  2.3390009
11   Promoter (2-3kb)  2.1797564
4              5' UTR  0.1244330
3              3' UTR  1.2549530
1            1st Exon  0.1156955
7          Other Exon  1.4102784
2          1st Intron  8.1670636
8        Other Intron 33.7840291
6  Downstream (<=3kb)  0.8216872
5   Distal Intergenic 47.0088168
ADD COMMENT
0
Entering edit mode

I had optimized this step and it runs more faster and required less memory.

Nice, that was what I thought. If we talk about something like 'all protein coding transcripts in ensembl' it is important that your code is clean. If not, I am not surprised that you run into problems.

ADD REPLY
0
Entering edit mode

Thank you very much! Now the program run super faster!

ADD REPLY
0
Entering edit mode
7.3 years ago
Lila M ★ 1.2k

wrong post !!!

ADD COMMENT

Login before adding your answer.

Traffic: 1974 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6