Thursday, July 5, 2012

phylocurve.perm: an R function for generating a rarefaction curve of Phylogenetic Diversity by randomisation


Here's a function I have written for the R statistical environment that calculates a rarefaction curve giving expected Phylogenetic Diversity (PD) for multiple values of sampling effort. Sampling effort can be defined in terms of the number of individuals, sites or species. This function is a variant of “phylocurve” that uses randomisation rather than an exact analytical formula to calculate an estimate of expected PD.  In addition, this version allows the calculation of estimates for the standard deviation, 95% confidence limits and minimum and maximum values of PD. The accuracy of these estimates is of course dependent on the number of permutations. I am providing it for free and without warranty under the GNU General Public License. You need to be familiar with R to use this function. The function also requires that the ape package be installed. To load the function, place the file in your working folder and type ‘source(“phylocurve_perm.R”)’.

WARNING: this function is much slower (about 10x to 100x) than “phylocurve”. For large datasets, I recommend testing with relatively large values of stepm and small values of perm to gauge the amount of computer time required.

Latest version: 17th August 2011. This version fixes a bug where rarefaction values were calculated incorrectly with community data tables containing species with total abundances of 0.

Usage
phylocurve.perm (x, phy, stepm=1, subsampling = “individual”, replacement = F, perm=1000)
Arguments
x is a community data table (as in the vegan package) with species/OTUs as columns and samples/sites as rows. Columns are labelled with the names of the species/OTUs. Rows are labelled with the names of the samples/sites. Data can be either abundance or incidence (0/1).
phy is a rooted phylogenetic tree with branch lengths stored as a phylo object (as in the ape package) with terminal nodes labelled with names matching those of the community data table. Note that the function trims away any terminal taxa not present in the community data table, so it is not necessary to do this beforehand.
stepm is the size of the interval in a sequence of numbers of individuals, sites or species to which x is to be rarefied.
subsampling indicates whether the subsampling will be by “individual” (default), “site” or “species”. When there are multiple sites, rarefaction by individuals or species is done by first pooling the sites.
replacement is a logical indicating whether subsampling should be done with or without replacement.
perm is the number of iterations of the subsampling routine used to calculate expected PD.

Details
phylocurve.perm takes a community data table and a rooted phylogenetic tree (with branch lengths) and calculates expected Phylogenetic Diversity (PD) for every specified value of m individuals, sites or species. m will range from 1 to the total number of individuals/sites/species in increments given by stepm. Estimates of expected PD as well as variance about that mean are determined by monte carlo randomisations (see Gotelli and Colwell 2001 for a more detailed explanation of the procedure as applied to species richness). When there are multiple sites in the community table and rarefaction is by individuals or species, sites are first pooled.
Value
phylocurve.perm returns a matrix object of seven columns with each row corresponding to a particular value of m. The columns are: 1) the values of m; 2) the mean (expected) values of PD for all randomisations; 3) the standard deviation of PD for all randomisations; 4) the 0.025 quantile (lower 95% confidence limit); 5) the 0.975 quantile (upper 95% confidence limit); 6) the minimum PD; and 7) the maximum PD.
References
Gotelli and Colwell. 2001. Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecology Letters 4: 379–391.

pd.resemble: an R function for calculating the pairwise resemblance in phylogenetic diversity of ecological samples


Here's a function I have written for the R statistical environment that calculates the pairwise resemblance (ie. either similarity or dissimilarity)  in Phylogenetic Diversity of multiple samples. I am providing it for free and without warranty under the GNU General Public License. You need to be familiar with R to use this function. The function also requires that the ape package be installed. To load the function, place the file in your working folder and type ‘source(“pdresemble.R”)’.

Latest version: 7th March 2011.

Please note that this function was previously called “phylosim”. However, an R package now has that name, so I changed the name of my function. This version of the function also allows for the calculation of either similarity or dissimilarity while previous versions only calculated similarity, although conversion between the two is trivial.

Usage
pd.resemble (x, phy, incidence = T, method = “sorensen”, dissim=T)
Arguments
x is a community data table (as in the vegan package) with species/OTUs as columns and samples/sites as rows. Columns are labelled with the names of the species/OTUs. Rows are labelled with the names of the samples/sites. Data can be either abundance or incidence (0/1).
phy is a rooted phylogenetic tree with branch lengths stored as a phylo object (as in the ape package) with terminal nodes labelled with names matching those of the community data table. Note that the function trims away any terminal taxa not present in the community data table, so it is not necessary to do this beforehand.
incidence is a logical indicating whether the data are to be treated as incidence (binary presence-absence) or abundance.
method indicates the particular form of the resemblance index you wish to use. Current options are: "sorensen" (default - 2a/a+b+c), "jaccard" (a/a+b+c), "simpson" (a/a+min{b,c}) and "faith" (a+0.5d/a+b+c+d).
dissim is a logical indicating whether the pairwise resemblance scores should be similarity or dissimilarity (default).
Details
pd.resemble takes a community data table and a rooted phylogenetic tree (with branch lengths) and calculates the resemblance in Phylogenetic Diversity (PD-resemblance) of all pairwise combinations of samples/sites. The principles for calculating PD-resemblance on incidence data are discussed by Ferrier et al. (2007). I have extended this approach to include abundance data (Nipperess et al. 2010).
Value
pd.resemble returns a dist object giving the PD-resemblance of all pairwise combinations of sample/sites in x.
References
Ferrier S, Manion G, Elith J & Richardson K. 2007. Using generalized dissimilarity modelling to analyse and predict patterns of beta diversity in regional biodiversity assessment. Diversity & Distributions 13: 252-264.
Nipperess DA, Faith DP & Barton K. 2010. Resemblance in phylogenetic diversity among ecological assemblages. Journal of Vegetation Science 21: 809-820.

phylo.endemism: an R function for calculating phylogenetic endemism of ecological samples


Here's a function I have written for the R statistical environment that calculates phylogenetic endemism of multiple samples. I am providing it for free and without warranty under the GNU General Public License. You need to be familiar with R to use this function. The function also requires that the ape package be installed. To load the function, place the file in your working folder and type ‘source(“phyloendemism.R”)’.

Latest version: 2nd December 2010.


Usage
phylo.endemism (x, phy, weighted = T)
Arguments
x is a community data table (as in the vegan package) with species/OTUs as columns and samples/sites as rows. Columns are labelled with the names of the species/OTUs. Rows are labelled with the names of the samples/sites. Data can be either abundance or incidence (0/1).
phy is a rooted phylogenetic tree with branch lengths stored as a phylo object (as in the ape package) with terminal nodes labelled with names matching those of the community data table. Note that the function trims away any terminal taxa not present in the community data table, so it is not necessary to do this beforehand.
weighted is a logical indicating whether weighted endemism (default) or strict endemism should be calculated.
Details
phylo.endemism takes a community data table and a rooted phylogenetic tree (with branch lengths) and calculates either strict or weighted endemism in Phylogenetic Diversity (PD). Strict endemism equates to the total amount of branch length found only in the sample/site and is described by Faith et al. (2004) as PD-endemism. Weighted endemism calculates the "spatial uniqueness" of each branch in the tree by taking the inverse of its range, multiplying by branch length and summing for all branch lengths present at a sample/site. Range is calculated simply as the total number of samples/sites at which the branch is present. This latter approach is described by Rosauer et al. (2009) as Phylogenetic endemism.
Value
phylo.endemism returns a vector object giving the phylogenetic endemism of all sample/sites in x.
References
Faith DP, Reid CAM & Hunter J. 2004. Integrating phylogenetic diversity, complementarity, and endemism for conservation assessment. Conservation Biology 18(1): 255-261.
Rosauer D, Laffan SW, Crisp MD, Donnellan SC & Cook LG. 2009. Phylogenetic endemism: a new approach for identifying geographical concentrations of evolutionary history. Molecular Ecology 18(19): 4061-4072

addbinary: an R function for converting an abundance table (x) into its additive binary form


Here's a function I have written for the R statistical environment that converts a table of species abundances into its additive binary form. That is, each column of abundance data (representing a single species across multiple sites) is converted into a number of columns equal to the maximum abundance of that species, with each new column representing an abundance value ranging from 1 to maximum abundance. Species abundance per site is recorded as a "1" in each column for which the abundance value is equal to or less than the site abundance.

If that makes no sense, let me attempt a simple example below with an abundance vector for a single species:

0 --> 000
3 --> 111
1 --> 100
0 --> 000
2 --> 110

The purpose for doing this, at least from my point of view, is to allow for the calculation of abundance-based forms of resemblance (similarity/dissimilarity) measures with R functions that only calculate the incidence-based, or binary, form. Tamas et al. (2001) have shown that any incidence-based resemblance measure calculated from the 2 x 2 contingency table can be transformed into its abundance-based equivalent by this method. So, for example, functions available in the simba and GDM packages can be "fooled" into accepting abundance-weighted data for calculating resemblance measures. Hopefully, this will be useful to somebody other than me.

I am providing this function for free and without warranty under the GNU General Public License. You need to be familiar with R to use this function.

Usage
addbinary (x)
Arguments
x is a community data table (as in the vegan package) with species/OTUs as columns and samples/sites as rows. Columns are labelled with the names of the species/OTUs. Rows are labelled with the names of the samples/sites. Data are abundances of species/OTUs per sample/site.
Details
addbinary converts a community data table to its additive binary form (see discussion above).
Value
addbinary returns a matrix object being the additive binary form of x.
References
Tamas et al. (2001) An extension of presence/absence coefficients to abundance data: a new look at absence. Journal of Vegetation Science 12: 401-410.