Normalization for small non-coding RNA (piRNAs)
3
1
Entering edit mode
5.4 years ago

Dear Biostars,

Currently, I'm working in piRNA expression in different cell lines and I would like to ask you about the way I can proceed with data transformation and normalization. (follow-up question regarding this Q) Now, the main issue is that in order to enrich for piRNAs we performed periodate treatment: "The PO treatment has been shown to be effective in separating piRNAs from other classes of small RNAs and degradation products of longer mRNA transcripts studies" We have treated libraries with ~10 million reads and untreated with ~45 million reads. In order to find piRNAs in our samples, we used SPORTS1.0 with output: matched reads to the genome and matched reads to small-RNA databases, unmatched reads to the genome and matched reads to small-RNA databases. For every database regarding different small RNA (rRNA, tRNA, piRNA, lncRNA ....) we get a file with the particular reads matched to that database.piRNA file example:

t00000406   617     +   piR-hsa-3546    3   CTGTTAACCGAAAGGTTGGTGGT     IIIIIIIIIIIIIIIIIIIIIII 1
t00000517   445     +   piR-hsa-3454    2   CACGTGTTAGGACCCGAAAGA   IIIIIIIIIIIIIIIIIIIII 0
t00000519   439     +   piR-hsa-3546    0   CGGCTGTTAACCGAAAGGTTGGTGGT IIIIIIIIIIIIIIIIIIIIIIIIII 0
t00000803   402 +   piR-hsa-3454    2   CACGTGTTAGGACCCGAAA IIIIIIIIIIIIIIIIIIIII   0
t00000817   394 +   piR-hsa-3454    3   ACGTGTTAGGACCCGAAAGA    IIIIIIIIIIIIIIIIIIII    0
t00001363   255 +   piR-hsa-3546    4   TGTTAACCGAAAGGTTGGTGGT  IIIIIIIIIIIIIIIIIIIIII  2
t00001363   255 +   piR-hsa-29932   0   TGTTAACCGAAAGGTTGGTGGT  IIIIIIIIIIIIIIIIIIIIII  2

The second column is the number of reads.

The majority of reads multimap to different piRNAs, so I took the sum of reads assigned to each piRNA (both unmatched/matched reads to the genome). for the example above my output file is:

piRNA         counts
  <chr>          <dbl>
1 piR-hsa-29932    255
2 piR-hsa-3454    1241
3 piR-hsa-3546    1311

If the above example is for a treated sample, the untreated sample is something like that:

    piRNA         counts
      <chr>          <dbl>
    1 piR-hsa-29932   3071
    2 piR-hsa-3454    12704
    3 piR-hsa-3546    12486

In order to adjust each sample for library differences, I performed this kind of data transformation: If we have 5 treated and 5 untreated biological replicates :

treated_1   6774893
treated_2   4973372
treated_3   7667539
treated_4   41842208
treated_5   18115268
untreated_1   17544293
untreated_2 7106260
untreated_3 5542361
untreated_4 5091629
untreated_5 41335714

We pick the "largest" library: treated_4 41842208 and divide reads by each library in order to get an upscaling factor:

tr %>% mutate(factors=max(tr$reads)/reads)
# A tibble: 10 x 3
   sample         reads factors
   <chr>          <dbl>   <dbl>
 1 treated_1    6774893    6.18
 2 treated_2    4973372    8.41
 3 treated_3    7667539    5.46
 4 treated_4   41842208    1   
 5 treated_5   18115268    2.31
 6 untreated_1 17544293    2.38
 7 untreated_2  7106260    5.89
 8 untreated_3  5542361    7.55
 9 untreated_4  5091629    8.22
10 untreated_5 41335714    1.01

Then we multiply every "feature" of each sample with the corresponding factor. If the 1st example is treated_1 then:

     mutate(pirna_counts,counts*tr$factors[1])
# A tibble: 3 x 3
     piRNA         counts upscaledcounts
  <chr>          <dbl>          <dbl>
1 piR-hsa-29932    255          1575.
2 piR-hsa-3454    1241          7665.
3 piR-hsa-3546    1311          8097.

What kind of normalization I could perform between libraries, what kind of batch effect correction should I use and is there any kind of bias that I'm introducing using this upscaling transformation?

normalization DEseq2 edgeR limma • 2.1k views
ADD COMMENT
0
Entering edit mode

What is the biological question, or comparison, you are interested in? Are there other conditions besides control vs treatment?

ADD REPLY
0
Entering edit mode

Well, the first question is "Are piRNAs expressed in these cell lines? If yes, could we check the relative expression in treated/untreated?". There are also other conditions but I cannot report them now.

ADD REPLY
4
Entering edit mode
5.4 years ago
A. Domingues ★ 2.7k

(this is a bit too long for a comment so adding it as an answer)

Judging also from your post in BioC it seems like you are interested in a sort non-standard analysis, or at least one that I am not familiar with. First things first.

Are piRNAs expressed in these cell lines?

A better way to answer this is question is to do a RIP of one or more of the core components of the piRNA pathways in humans ( Hiwi, Hili and Hiwi2?), followed by small RNA-seq and see if the profile of the small RNAs matches what is know:

  • RNA length profile
  • 1U and 10A bias
  • ping-pong signal

I would use all mapped reads, and not a subset of those reads that match a database. To be fair you can already do this with the small RNA you have, but showing that there are indeed bound to a protein of the pathway is much more convincing in a new system.

If yes, could we check the relative expression in treated/untreated?

For this I would do the same piRNA characterization as about, normalizing to mapped reads, and see if the signal is better in the treated libraries. Not particularly quantitative but if the treatment really is effective in enriching for piRNAs, it will pass the eyeball test.

If you want to go down the route of using small RNA databases and DESeq2, then count each read only once (otherwise you are creating data that doesn't exist) and use the read counts for a fairly standard DESeq analysis. Two things to keep in mind:

  1. assuming the treatment is really effective the library composition will be biased and using only the table of piRNAs counts might (or) will violate a few assumptions of DESeq2, specially that the majority of genes will not change with treatment. Using all small RNAs might be able to alleviate this global effect.
  2. This analysis will tell you which piRNAs change with treatment, but I don't know if you can say anything about enrichment of piRNAs in general, but maybe someone with a better knowledge of statistics can chime in on this.

So, I would keep it simple and normalize to mapped reads for general small RNA composition / properties analysis, though some people normalize to miRNAs or other class of small non-coding RNAs, and above all keep it simple and use tried and trusted analysis / tools (as suggested in a previous answer). If the results are promising and hold to some basic predictions and assumptions about piRNAs, explore a bit more.

edit:

Maybe I misunderstood, but does this mean that a read is counted more than once? Each read should be counted only once. If you want to use multi-mapping reads the best options are to randomly select a piRNA or weight it - read matches two piRNAs, each gets 0.5 counts.

No, you didn't misunderstand. I've also posted an informative example in Biostars . I don't want to choose randomly a piRNA because it may be misleading. Using weights is a possibility but what about a read that matches 5 piRNAs such as these: piR-51199 TGCCAAACTAAGCAAGGTCACGTGTGA piR-51200 TGCCAAACTAAGCAAGGTCACGTGTGAA piR-51201 TGCCAAACTAAGCAAGGTCACGTGTGAAG piR-51202 TGCCAAACTAAGCAAGGTCACGTGTGAAGA piR-51203 TGCCAAACTAAGCAAGGTCACGTGTGAAGG It's one will get 0.2 counts. Is it the correct way to "counter" multi-mapping?

First of all I would argue that you your read corresponds to only one of those piRNAs. It has a defined sequence and length, so it can only be an 100% reciprocal match with one of those 5 sequences:

read TGCCAAACTAAGCAAGGTCACGTGTGAA

piR-51199 TGCCAAACTAAGCAAGGTCACGTGTGA

piR-51200 TGCCAAACTAAGCAAGGTCACGTGTGAA

piR-51201 TGCCAAACTAAGCAAGGTCACGTGTGAAG

piR-51202 TGCCAAACTAAGCAAGGTCACGTGTGAAGA

piR-51203 TGCCAAACTAAGCAAGGTCACGTGTGAAGG

Yes it has a partial match with the others, but not a full match.

Now if we are talking about multimapping in terms of genomic location, in my opinion there is no "right" way* to solve the issue, since they all involve a compromise between sensitivity and specificity. For most of the things I do I don't care where a piRNA maps, as long as it maps to a location in the genome. So I pick __one location__ randomly. For some specific purposes, I need to know the exact mapping location and if can't get it, though luck, throw away all multimappers. I never used weights because I see little advantage in it, but it is an acceptable compromise.

*However there is a bad choice: using all mapped locations for a read. In the case you mention, one read mapping to multiple piRNAs your data will blow up from one RNA that was actually sequence to 5. If it maps to 100 different locations you are suddenly analysing hundreds of reads or RNAs that were not cloned at all.

I hope this helps a little.

ADD COMMENT
0
Entering edit mode

Dear António Miguel de Jesus Domingues, thank you for the detailed answer! I will focus on the second part of the answer as it seems that I do not have a correct mapping of reads(regarding small RNAdbs) and it's a core problem.

read TGCCAAACTAAGCAAGGTCACGTGTGAA has a full match with piR-51200 and partial match with piR-51201-3. read TGCCAAACTAAGCAAGGTCA has a partial match (is it a correct example?) with all of them.

In my analysis, I align reads with length>15. This leads to having many reads with partial matching. Many small-RNA map to a different location in the genome thus many reads would multi-map and produce this kind of issue. At first, I thought that there is a difference between a multi-mapping read and a multi-mapped small-RNA(piRNA) but when we have that kind of problem(piR-51199-piR-51203) then it seems a very complicated issue.

piR-51199-piR-51202 map to the same location in the genome. All reads (with length less than piR-51199) that map to that particular location are multi-mapped between these piR-51199-piR-51202. How do we "solve" this issue?

If I understand correctly I have to use only reads with a full match and so exclude all multi-mapped/partially matched? -> Map reads to genome. For multi-mapped reads, do I keep one location only or exclude them?

Best regards K

ADD REPLY
1
Entering edit mode

I align reads with length>15.

This is very short. I would start at about 18 bp and even then I can't of a small RNA that small (though they might exist).

I thought that there is a difference between a multi-mapping read and a multi-mapped small-RNA(piRNA) but when we have that kind of problem(piR-51199-piR-51203) then it seems a very complicated issue

These are two different issues and I would argue that only the multimapping read is really a problem. A "multi-mapped small-RNA(piRNA)" is totally dependent on whatever was defined as a piRNA on a database. Like I said before I don' use those DB and afaik neither do most high-profile labs in the piRNA field.

piR-51199-piR-51202 map to the same location in the genome. All reads (with length less than piR-51199) that map to that particular location are multi-mapped between these piR-51199-piR-51202. How do we "solve" this issue?

This is a very good question which leads to the definition of a piRNA. As I mentioned in an earlier answer to on one of your posts, when dealing with piRNAs, at least for some species (_C. elegans_ does it's own thing), forget about classical gene definitions. Ignore them really. piRNAs are not (usually) defined as transcriptional units and you can have samples with the same genetic background and yet very different piRNA populations. piRNAs can also be generated de novo and passed on to the next generation (in the germline). So this is complicated (and fascinating) but not for the reasons you outline.

__So is it any cloned sequence binding to a piwi protein a piRNA?__

In this case, piR-51199 and piR-51202 are two different piRNAs.

__Are all sequences that map to the same 5' position but have variable lengths?__

Then piR-51199 and piR-51202 are in fact the same piRNA and should be listed as such in the DB.

And does it really matter? My opinion is that depends on the analysis and the caveats / compromises one is willing to accept. For your analysis, and without knowing the ultimate goal, I am guessing it doesn't. What I usually, if I want to do some kind of differential analysis of piRNAs, is:

  • map the small RNA to the genome
  • count the number of piRNAs targeting each transposable element (using repEnrich), or simply count reads mapping to annotated genes
  • and see how a treatment / mutant / dev stage affected the piRNA targeting.
ADD REPLY
0
Entering edit mode

Thank you again for the detailed answer. Is it possible to extend a bit the part of

map the small RNA to the genome

For example:

  • Bowtie with 1 mismatch and no multi-mapping
  • apply featureCounts or summarizeOverlaps or tximport to count all small-RNAs (using specific libraries or some other file?)
  • use repEnrich
  • Normalization of the counts
  • DE analysis if possible

simply count reads mapping to annotated genes"

But how do we now that these kinds of reads derive from piRNAs? How am I going to annotate piRNAs without using coordinates from a db?

ADD REPLY
1
Entering edit mode
5.4 years ago
A. Domingues ★ 2.7k

edit: this should have been a comment. It's late.

But how do we now that these kinds of reads derive from piRNAs?

I would thread carefully in the case of piRNA derived from protein-coding genes. But you can still, and very naively count reads mapping to annotated genomic features such as rRNA, miRNA, repeats, etc. If a read maps to transposons, has a certain length (usually 25-30 depending on the species), starts with a U (T) or has an A position 10, it is quite likely to be a piRNA. You can always see if it originates from a cluster, but I am not sure how well those annotated in human. Also, there is extensive literature of piRNA biology in Drosophila but don't assume all mechanics will also work in the same way in other species because it probably won't, though the broad functionality is conserved.

Ultimately there is always a degree of uncertainty when studying piRNAs that you have to accept. I urge to read the literature, in particular the mouse model system which might be the closest to human, and see what are the common data analysis practices in the field. Also ask people in your lab about piRNA biology.

ADD COMMENT
0
Entering edit mode
5.4 years ago
A. Domingues ★ 2.7k

rRNA, miRNA, tRNA

I forgot to add that based simply on mapping to these annotated features, it is quite likely the reads are not originating from piRNAs.

ADD COMMENT

Login before adding your answer.

Traffic: 2052 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6