Retrieving Phylop Scores Using R
1
1
Entering edit mode
10.4 years ago

I am looking for a method of retrieving phyloP scores (a per base measurement of evolutionary conservation) using R, given a set of chromosomal coordinates. I have found databases hosted by the UCSC Genome Browser, but they have proved very unwieldy. Does anyone know of any packages or other tricks for quickly fetching phyloP scores given a range of chromosomal coordinates?

r conservation • 7.6k views
ADD COMMENT
11
Entering edit mode
10.4 years ago

Not with R, but you could use BEDOPS tools:

  • Convert the per-chromosome phyloP WIG files on UCSC's servers to sorted BED using wig2bed.

  • Prepare your genomic regions into a sorted BED file containing per-base elements.

  • Run bedmap on the phyloP data and your regions to map per-base phyloP scores to your per-base regions.

Retrieving phyloP data and converting it to sorted BED

We can make per-chromosome phyloP BED files and ultimately use these with bedmap or other BEDOPS tools. As an example, we can use the vertebrate hg19 phyloP data located at http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/

$ rsync -avz --progress rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate .

...

$ for fn in `ls vertebrate/*.gz`; \
    do gunzip -c $fn \
        | wig2bed - \
        > ${fn%%.*}.bed; \
    done

Preparing your regions-of-interest

Let's assume your regions-of-interest are already in a four-column (or more) BED file called roi.contiguous.unsorted.bed. To make sure it works with BEDOPS correctly, we sort it and create a file called roi.contiguous.bed:

$ sort-bed roi.contiguous.unsorted.bed > roi.contiguous.bed

Next, we want to split the ranges into per-base elements. We assume there is an ID field in the fourth column. We also assume that the ID for a given region-of-interest is uniquely named. When splitting our regions, we can use this ID as a unique prefix, in order to specifically and uniquely identify with which range the per-base element associates:

$ awk ' \
    { \
         regionChromosome = $1; \
         regionStart = $2; \
         regionStop = $3; \
         regionID = $4; \
         baseIdx = 0; \
         for (baseStart = regionStart; baseStart < regionStop; baseStart++) { \
             baseStop = baseStart + 1; \
             print regionChromosome"\t"baseStart"\t"baseStop"\t"regionID"-"baseIdx; \
             baseIdx++; \
         } \
    }' roi.contiguous.bed > roi.perBase.bed

Run head or less on the roi.perBase.bed file so that you see what this script does. When we map phyloP scores, we will be able to map them base-by-base.

Note: If your BED file does not have an ID field, or region IDs are not unique, it is trivial to use awk to add or replace the ID field with a unique, per-row identifier.

Performing bedmap operations

Finally, we're ready to do the map operation and see which elements associate:

$ for chrLabel in `seq 1 22` X Y; \ 
    do \
        mapFn=/path/to/phyloP/chr${chrLabel}.bed; \
        bedmap --chrom chr${chrLabel} --echo --echo-map-score roi.perBase.bed $mapFn \
            > perBase.chr${chrLabel}.answer.bed; \
    done

The file perBase.chr${chrLabel}.answer.bed (where chr${chrLabel} is replaced with the chromosome name from the per-chromosome phyloP files) contains per-base-split regions-of-interest along with the phyloP scores that associate with each base.

To explain the bedmap statement, the --echo operator prints out the region-of-interest (a per-base element from your original regions-of-interest) and --echo-map-score prints out the phyloP score that associates with that base, for each chromosome.

In the case where no phyloP score maps to a base, a NAN is printed.

You may or may not want to use sed or other tools to replace NAN with a zero, particularly if the presence of a zero is a biologically meaningful score, while the absence of a score is also biologically meaningful, but in a different way.

Some more work to do

A couple more things you could do that might make this easier, if you have to automate this process:

  • Concatenate the per-chromosome phyloP BED files into one master BED file.

  • Merge the split elements in the perBase.chr${chrLabel}.answer.bed result back into contiguous ranges, using the ID prefix to group them. In a column adjacent to the newly re-merged contiguous range, print all the scores into a row vector that is delimited with some useful character.

Doing step 1 would remove the need to use a for loop on the bedmap statement. You then just have one bedmap operation against the master phyloP file, instead of looping through each per-chromosome phyloP file.

Step 2 would let you more easily process contiguous regions. For example, you might be working with a window around a set of transcription start sites and you want to know what the mean conservation is across each window.

Hope this gets you started!

ADD COMMENT
1
Entering edit mode

Thanks for sharing details and also for phastcons in a separate post. One correction: I think there should be --chrom $chr option in bedmap --echo --echo-map-score roi.perBase.bed $fn... else it is echoing all chromosomes from roi.perBase.bed. That is strange given $fn only has one chromosome and mapping seems to be based on start-stop columns and ignoring chromosome column.

ADD REPLY
0
Entering edit mode

You're right that --chrom $chr is needed. The file roi.perBase.bed is the "reference file" and mapping will be done against that. So even if no overlaps are found in the map file $fn (or now chr${chrLabel}, to track chromosome names), elements will be reported from all chromosomes of the reference file (unless --skip-unmapped is used, but that is not useful here). Good catch.

ADD REPLY
0
Entering edit mode

Maybe you can give some suggestions on a good strategy of assigning the unique per row identifier. Not sure what is most appropriate to this method just given a simple bed file. My main interest is to get the scores from the merged elements which would represent a single score per interval within the bed. My bed file looks as below. Thanks!

    seqnames    start   end width   strand  baseMean    log2FoldChange  lfcSE   stat    pvalue  padj

chr1    31583078    31583428    351 *   227.0067476 -7.376694254    0.964565006 -7.64769011 2.05E-14    2.10E-08
ADD REPLY
1
Entering edit mode

In Python:

#!/usr/bin/env python

import sys

eidx = 0
for line in sys.stdin:
  elems = line.rstrip().split('\t')
  chr = elems[0]
  start = int(elems[1])
  stop = int(elems[2])
  bidx = 0
  for idx in range(start, stop):
    sys.stdout.write('%s\t%06d-%06d\n' % ('\t'.join([chr, str(idx), str(idx+1)]), eidx, bidx))
    bidx += 1
  eidx += 1

Then:

$ ./split.py < in.bed | sort-bed - > out.bed

The file out.bed will be sorted and contain per-base positions within each element.

The ID field is effectively unique by using the form A-B, where A is the element or row index, and B is the position or base index within the element. I use zero-padding to make the result easier to troubleshoot/read.

If all the elements in in.bed are of the same length, this makes mapping and later aggregation very fast, as you can split the ID field into element index and base position index. The base index can directly reference the contents within an array at a specific position.

ADD REPLY
0
Entering edit mode

This just may be my lack on Python understanding but in the initial python script where is the bed file input, that file requiring the unique identifiers? Thank you!

ADD REPLY
1
Entering edit mode

Maybe I misunderstand. I'm assuming you have a BED file that you want to map and aggregate on a per-base level with phyloP conservation scores (or whatever per-base signal, but phyloP for the purposes of this question). This BED file is made up of regions of interest, like TFBS or promoters, etc. What are you trying to do?

ADD REPLY
0
Entering edit mode

That is exactly right, at the end Id like to have a conservation score based on phyloP for each interval in my bed file to sort for downstream analysis. I was just asking about the python script you listed and how I can input this initial bed file into it so as to add unique identifiers to it before proceeding with your initial example in this post. Thanks!

ADD REPLY
1
Entering edit mode

If you need to strip the header, you could do:

$ ./split.py <(tail -n+2 in.bed) | sort-bed - > out.bed

Then you'd use out.bed for scoring.

ADD REPLY
0
Entering edit mode

Thats perfect thank you!

ADD REPLY
0
Entering edit mode

Quick question, tried doing mapping as you suggested with the following code:

for fn in `ls mm10.60way.phyloP60way/*.bed`; \ 
    do bedmap --echo --echo-map-score roi.perBase.bed $fn \
        > perBase.$fn.answer.bed; \
    done

And got this error, not sure why its adding answer.bed to the phyloP files in this instance? Thanks!

-bash: perBase.mm10.60way.phyloP60way/chr10.phyloP60way.wigFix.bed.answer.bed: No such file or directory
-bash: perBase.mm10.60way.phyloP60way/chr11.phyloP60way.wigFix.bed.answer.bed: No such file or directory
-bash: perBase.mm10.60way.phyloP60way/chr12.phyloP60way.wigFix.bed.answer.bed: No such file or directory
-bash: perBase.mm10.60way.phyloP60way/chr13.phyloP60way.wigFix.bed.answer.bed: No such file or directory
-bash: perBase.mm10.60way.phyloP60way/chr14.phyloP60way.wigFix.bed.answer.bed: No such file or directory
ADD REPLY
1
Entering edit mode

Can you print a snippet of the results from this?

for fn in `ls mm10.60way.phyloP60way/*.bed`; do \
    echo $fn; \
    echo ${fn##*/}; \
done

I suspect you may need to use the second variable (the "basename") in some form, instead of the raw $fn. I'd troubleshoot there.

ADD REPLY
0
Entering edit mode

ok its really strange but I redid the import of mm10 phyloP as follows:

rsync -avz --progress rsync://hgdownload.cse.ucsc.edu/goldenPath/mm10/phyloP60way/mm10.60way.phyloP60way .

...

This gave a directory mm10.60way.phyloP60way which contained all of these the necessary .gz files, following this I ran:

for fn in `ls mm10.60way.phyloP60way/*.gz`; \
    do gunzip -c $fn \
        | wig2bed - \
        > ${fn%%.*}.bed; \
    done

Which I thought would give me individual .bed files for use with bedmap however it concatenated all of them it seems into one file named mm10.bed where the only chromosome listed is chrY for some odd reason, can you see why this would be? I guess I can wig2bed all of them individually but its just odd.

ADD REPLY
1
Entering edit mode

Here's what I see:

$ for fn in `ls mm10.60way.phyloP60way/*.gz`; do echo ${fn}; echo ${fn%%.*}; done
mm10.60way.phyloP60way/chr10.phyloP60way.wigFix.gz
mm10
...
mm10.60way.phyloP60way/chrY.phyloP60way.wigFix.gz
mm10

I think you may want a slight tweak:

$ for fn in `ls mm10.60way.phyloP60way/*.gz`; do echo ${fn}; echo ${fn##*/}; done
mm10.60way.phyloP60way/chr10.phyloP60way.wigFix.gz
chr10.phyloP60way.wigFix.gz
...
mm10.60way.phyloP60way/chrY_JH584301_random.phyloP60way.wigFix.gz
chrY_JH584301_random.phyloP60way.wigFix.gz

In other words, use ${fn##*/} and not ${fn%%.*}. Maybe prefix ${fn##*/} with another directory.

Bash variables like these are a huge pain in the ass. So I'd suggest getting into the habit of using echo to debug and sanity-check what your source and destination paths look like, before running the for loop with real data.

ADD REPLY
0
Entering edit mode

This worked perfectly, thanks for tip in terms of checking the bash variables and adjusting them. I sometimes neglect to think of these path issues. In terms of concatenating the per-chromosome phyloP BED files would you suggest just using bedtools merge or something different? Also I was hoping you could give a suggestion on a good way to merge the perBase.$fn.answer.bed back into continuous ranges, since at the end I would like just a regular interval bed file where every interval has some conservation score that I can thereafter sort by. Thanks again!

ADD REPLY
1
Entering edit mode

Let's say you have a per-base BED file that looks like this, where score or signal data are in the fifth column:

chrZ    1234    1235    id-000001      0.4321
chrZ    1235    1236    id-000002      0.9385
chrZ    1236    1237    id-000003      1.3533
...

You can get contiguous ranges from this via:

$ bedops --merge perBase.bed > merged.bed

You could then use bedmap to get the mean score or signal over the merged range:

$ bedmap --echo --mean --delim '\t' merged.bed perBase.bed > meanPerBaseSignalOverMergedRegions.bed

The bedops tool does set operations. The bedmap tool does map operations. Here, we map signal to genomic intervals, and we do some specified operation on that signal.

If you want something other than a mean, there are other signal operations: https://bedops.readthedocs.io/en/latest/content/reference/statistics/bedmap.html#score-operations

You could get the median conservation score over the merged range, for instance.

You can also specify multiple signal operations:

$ bedmap --echo --mean --stdev --delim '\t' merged.bed perBase.bed > meanAndSdPerBaseSignalOverMergedRegions.bed

This gives you the mean and standard deviation of signal over merged regions.

Using the --delim '\t' operator puts these results into separate tab-delimited columns, for example, to make it easy to pipe to sort to get a sorted list:

$ bedmap --echo --mean --delim '\t' merged.bed perBase.bed | sort -nk4,4 > meanPerBaseSignalOverMergedRegions.sortedBySignal.txt

Note that I change the result's extension from bed to txt, to serve as a reminder that the output is no longer sort-bed-sorted, and so I would need to do sorting before use with set operation tools.

ADD REPLY
0
Entering edit mode

That works great thanks!

Still having a little trouble establishing the right variable for mapping my data with unique IDs to the phyloP per chromosome files, my data looks as follows:

chr1    11924388    11924389    000960-000000
chr1    11924389    11924390    000960-000001
chr1    11924390    11924391    000960-000002
chr1    11924391    11924392    000960-000003

I run the following commands, with similarly changed variables before using bedmap:

for fn in `ls mm10_phyloP_bed/*.bed`; \ 
    do bedmap --echo --echo-map-score roi.perBase.bed ${fn##*/}  \
        > perBase.${fn##*/}.answer.bed; \
    done

I get this output which appears to be missing the scores entirely, not sure why the --echo-map-score flag does not add scores and only the pipe:

chr1    11924389    11924390    000960-000001|
chr1    11924390    11924391    000960-000002|
chr1    11924391    11924392    000960-000003|

Any idea as to why? Thanks again!

ADD REPLY
0
Entering edit mode

It is likely because those positions have no scores associated with them.

If you do something like this:

$ echo -e 'chr1\t11924390\t11924391' | bedops -e 1 perBase.chr1.answer.bed -

and if you get no answer back, that means there is no element in perBase.chr1.answer.bed that has a score associated with it.

If it helps with processing results, you can add the --skip-unmapped operator to the bedmap command and it will leave out elements in the output where there are no mappings between reference and map elements, i.e.:

for fn in `ls mm10_phyloP_bed/*.bed`; \ 
    do bedmap --skip-unmapped --echo --echo-map-score roi.perBase.bed ${fn##*/}  \
        > perBase.${fn##*/}.answer.bed; \
    done
ADD REPLY
0
Entering edit mode

So two questions, for phyloP to have a score at a particular base on lets say chromosome 1, what criteria needs to be met in terms of level of conservation? Looking this up as well, just surprised that so many of these have nothing associated though when I just scan these in UCSC for many I see at least some conservation - though maybe thats just a bias on my end.

Second question, if I invoke the --skip-unmapped flag during mapping wouldn't this preclude me from thereafter collapsing into contiguous intervals?

Thanks for your help!

ADD REPLY
2
Entering edit mode

I'd take a very careful look at the distribution of your observed phyloP scores before imputing zeros in gaps. If there's a gap where conservation could not be determined over the species alignment, then putting in a zero could say this position is neutral when it simply isn't known. That could be problematic.

That caveat given, here is an ugly way you could use bedmap to deal with that case:

for fn in `ls mm10_phyloP_bed/*.bed`; do \
    paste roi.perBase.bed <(bedmap --echo-map-score roi.perBase.bed ${fn##*/} | awk '{ if (length($0) == 0) { print "0" } else { print $0 } }') \
done

If you're doing column aggregation for elements of equal length, this should give you a score of some kind at every per-base position over your set of elements.

An unreleased version of bedmap includes the option to report a zero score where there are unmapped reference elements, but it may be a little longer before the next version of BEDOPS comes out.

If you are more comfortable leaving out zeros, you might simply aggregate over the non-NaNs. Say you have two vectors (2, 10, NAN) and (3, 2, 1). You would take the average of the first two elements, and leave the third element as it is. Not great, but if you have a bunch of such vectors, you could think about calculating a per-base or per-position standard deviation or variance specific to the n at that position, and return this along with the mean or other per-base statistic.

ADD REPLY
1
Entering edit mode

In the case of missing data, you might impute a value of zero for neutral conservation or selection. But this really depends on the data, and I don't remember exactly how this works for phyloP. To do imputation you'd need an extra step. I'm on my phone so I'd have to wait until I'm in front of a proper keyboard to describe the specifics.

ADD REPLY

Login before adding your answer.

Traffic: 1610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6