Tuesday, January 21, 2014

phylorare: an R function for calculating the rarefied Phylogenetic Diversity of ecological samples


Here's a function I have written for the R statistical environment that calculates expected phylogenetic diversity (under rarefaction) of multiple samples. The function uses an exact analytical formula (Nipperess & Matsen 2013). I am providing it for free and without warranty under the GNU General Public License. You need to be familiar with R to use this function. The function also requires that the ape package be installed. To load the function, place the file in your working folder and type ‘source(“phylorare.R”)’.

UPDATE (21st Jan 2014): When using sampling without replacement (classic rarefaction), older versions of this function would not work with datasets that have a large number of objects (individuals/sites/species) to be rarefied. This is because phylorare must calculate the number of possible combinations of a subset of objects in the total and this can be a very large number. It is possible, therefore, to exceed the largest number that R can handle. If that occurred, no warning would be issued but the output would include NAs instead of expected values of Phylogenetic Diversity. I have fixed this problem such that phylocurve should now be able to handle very large numbers (I had to re-learn some high school maths but I got there eventually). It might still be possible to exceed the limits of R but you should have a lot more headroom. If you do still encounter this problem, your options at this point are to: 1) sample with replacement (replace=T); 2) try using a more powerful computer (although this is unlikely to make much difference); or 3) use a much slower algorithmic solution (such as phylocurve.perm).

Usage
phylorare (x, phy, m=1, subsampling = “individual”, replace =F)
Arguments
x is a community data table (as in the vegan package) with species/OTUs as columns and samples/sites as rows. Columns are labelled with the names of the species/OTUs. Rows are labelled with the names of the samples/sites. Data can be either abundance or incidence (0/1).
phy is a rooted phylogenetic tree with branch lengths stored as a phylo object (as in the ape package) with terminal nodes labelled with names matching those of the community data table. Note that the function trims away any terminal taxa not present in the community data table, so it is not necessary to do this beforehand.
m is the number of individuals, sites or species to which x is to be rarefied.
subsampling indicates whether the subsampling will be by “individual” (default), “site” or “species”. When there are multiple sites, rarefaction by individuals or species is done separately for each site.
replace is a boolean indicating whether subsampling should be done with or without replacement.

Details
phylorare takes a community data table and a rooted phylogenetic tree (with branch lengths) and calculates expected Phylogenetic Diversity (PD) for a given sample size of m individuals, sites or species. Calculations are done using an exact analytical formula generalised from the classic equation of Hurlbert (1971). When there are multiple sites in the community table and rarefaction is either individual or species-based, rarefied PD values are calculated for each site separately. In this case, all PD values will include the root of phy even if all the species in a site share a more recent common ancestor.
Value
phylorare returns a vector object giving the expected PD values of all sample/sites in x.
References
Hurlbert. 1971. The nonconcept of Species Diversity: a critique and alternative parameters. Ecology 52: 577-586.
Nipperess & Matsen. 2013. The mean and variance of phylogenetic diversity under rarefaction. Methods in Ecology & Evolution 4: 566-572Manuscript also available from ArXiv.org