Our application note on Spidermonkey has been published online at Bioinformatics Advance Access.
This method was introduced in the evolutionary context in our
on the evolution of the phenotypically important and highly variable V3 loop of the envelope glycoprotein
Spidermonkey is an analytical pipeline (series of data-processing algorithms) for identifying co-evolving sites from in an alignment of protein-coding sequences.
We provide an easy-to-use web interface to upload and screen alignments, infer a tree in cases when none is available, reconstruct the evolutionary
history of sequences, and analyze coevolutionary patterns in that history using Bayesian graphical models (?).
The first three stages of the Spidermonkey analysis of protein-coding nucleotide sequences (e.g. codon sequences) in fact use the same pipeline as SLAC,
but for your convenience the corresponding help text from the SLAC help page
is provided below, plus additional material specific to Spidermonkey. We also enable you to detect co-evolving sites in non-coding nucleotide and amino acid sequences using methods similar to the SLAC pipeline (fitting the respective substitution model by maximum likelihood and reconstructing ancestral sequences).
Uploading your data
Because Spidermonkey utilizes the same data input methods as the rest of the analyses on Datamonkey, it accepts alignments in FASTA, PHYLIP, NEXUS, and MEGA formats.
For further details, please refer to the help page on preparing your data.
As usual, you will be given the option of using your own tree if it is included in the file along with the sequences;
otherwise, you can choose to infer a tree de novo using the neighbor-joining method implemented in Datamonkey.
Additionally, you can decide to use a specific model of nucleotide substitution (e.g. HKY85) or use the automated
model selection algorithm in Datamonkey (based on likelihood).
[IMPORTANT] If your alignment contains more than one partition, you will not be allowed to choose Spidermonkey as an analytical option
after the data upload and tree option pages.
This is because Spidermonkey compares information about how non-synonymous substitutions
are distributed across tree branches on a site-by-site basis. If two sites have different trees
associated with them, it is generally impossible to match up inferred substitutions on the affected branches in
a meaningful way.
Your alignment may be partitioned if you have previously analyzed it for recombination
using GARD or other methods in HyPhy. Spidermonkey assumes that there has been no recombination in the alignment,
so if you need to analyze for co-evolving sites in an alignment when recombination is present, extract each partition and upload them separately
(running Spidermonkey for each individual partition). Note that in some cases it isn't strictly necessary to run separate partitions,
when phylogenetic incongruence is due to variation in rates and not recombination. To determine whether or not this is the case,
inspect the 'best split' or raw output from GARD and compare the topologies of the split trees.
Phase 1: Nucleotide/protein model maximum likelihood (ML) fit
For codon or nucleotide sequences, a nucleotide model (any model from the time-reversible class can be chosen) is fitted to the data and tree (either NJ or user supplied)
using maximum likelihood to obtain branch lengths and substitution rates. The "best-fitting" nucleotide model can be determined automatically by
a model selection procedure or chosen by the user. An analysis of non-coding nucleotide sequences jumps directly to Phase 3.
For amino acid sequences, one of several empirically-derived amino acid substitution rate matrices (e.g. PAM, HIV within; Nickle et al. 2007) are fit using maximum likelihood as above. We also provide an AIC-based model selection procedure to discriminate between several canonical models of protein evolution on the basis of the data. An analysis of protein coding sequences jumps directly to Phase 3.
Phase 2: Codon model ML fit
Holding branch lengths and subsitution rate parameters constant at the values estimated in Phase 1,
a codon model obtained by crossing MG94 and the nucleotide model of Phase 1 is fitted to the data to obtain a global ω=dN/dS ratio.
Optionally, a confidence interval on dN/dS can be obtained using profile likelihood.
Phase 3: Reconstructing ancestors and counting substitutions
Utilizing parameter estimates from Phases 1 and/or 2, ancestral sequences are reconstructed site by site using maximum likelihood, in
such a way as to maximize the likelihood of the data at the site over all possible ancestral character states. Inferred ancestral sequences are
treated as known for Phase 4.
It is also possible to resample ancestral sequences, weighting random outcomes in proportion to their posterior probability given your alignment. Resampling ancestral sequences provides a means of quantifying our confidence
in identifying co-evolving sites given our uncertainty in inferring ancestral sequences.
In the case of codon sequences, only nonsynonymous substitutions are mapped to branches of the tree, i.e., based on whether codons on either end of the branch encode different amino acids
(details). Codon sites at which no nonsynonymous substitutions
have been inferred are automatically discarded from further analysis. Similarly, all invariant sites in nucleotide or protein sequences are discarded.
Intermission: Everything you ever wanted to know about Bayesian graphical models but was too socially well-adjusted to ask
A Bayesian graphical model (BGM) is a compact representation of the joint probability distribution of many random variables.
In mathematics and computer science, a graph (or network) is a visual depiction of the relationship between two or more individuals,
in which each individual is represented by a node (i.e. a point or vertex). Relationships between nodes are indicated
by drawing a line connecting the nodes, which is referred to as an edge. In this context, edges represent significant
statistical associations between individual codon sites of an alignment, which could be caused by functional or structural
interactions between the corresponding residues of the protein.
Edges may be directed (arrows) to indicate that one node is conditionally dependent on another;
otherwise, they are undirected and we make no distinction between whether A is dependent on B or vice versa.
For example, one would be inclined to draw an arrow starting from a node for whether it is raining,
and terminating at a node for whether the lawn is wet. (It's more intuitive to think about this in terms of the state
of node A causing the state of node B, but with this we are treading near the morass of inferring causality from data;
while many believe that this is possible, it remains an open philosophical problem and hence very bad juju for us simple-minded biologists.)
The power of BGMs lies in exploiting the Markov property (aka as conditional independence) to isolate each node from the rest of the graph
once its dependence on its neighbors is accounted for.
For example, suppose we wanted to figure out the probability that my genome contains a particular genetic marker.
Naively, I would have to assume that this probability was roughly equal to the expected frequency of that marker in the population.
(Of course, we could be more sophisticated and stratify the population, using the frequency for my particular ethnic group, etc.)
Now, if I learn that my paternal grandfather had X copies of the marker, then that I could now condition (at least in part) my probability
on this observation. Moreover, if I subsequently learn my FATHER's genotype (this is becoming a well-characterized family!),
then (all else being equal) it no longer matters what my grandfather's genotype was.
If you still don't understand what we're talking about here, you're probably thinking about it too much. :-)
Another advantage of using BGMs is that we're looking at all the variables, all at the same time.
This is a big improvement over applying, say, a pairwise association test statistic (e.g. chi-squared contingency tables)
to every pair of variables and throwing all the significant pairs into a big bucket.
Oftentimes, this results in a big mess as far as trying to understand the big picture.
It's like trying to read a street map that's been through an industrial blender.
To make matters worse, pairwise comparisons can pick up a large number of extraneous associations.
In the family example above, my genotype depends on my grandfather's genotype only insofar it influences
my father's genotype. In pairwise comparisons may obtain THREE significant associations (me-dad, me-grandpa, dad-grandpa)
where there are actually only two (me-dad, dad-grandpa).
Being able to ignore the rest of the graph when we're computing the (posterior) probability of each node makes life a lot easier.
After we're done, we can use the chain rule of probability to compute the joint posterior probability of the whole graph (nodes, edges, and all).
Every graph is a hypothesis of how variables work together. Our goal is to find the most likely graph given our data.
However, this isn't always a feasible task. The number of possible graphs (esp. if we're working with directed edges) grows
factorially with the number of nodes (for example, there are over a trillion different undirected graphs on just 10 nodes).
Being biologists, we won't usually have enough data to be able to choose just one graph out of thousands of millions as being the most likely graph.
This is where
order-MCMC comes in.
Instead of trying to pick just one graph, we're going to average over subsets of graphs,
which are going to be defined by permutations in a hierarchical ranking of the nodes in the graph (node orders).
A permutation is one of many ways of ordering a set of distinct objects into a linear sequence.
For example, the Three Stooges can walk into a door-frame in six permutations: Larry, Curly, Moe; Larry, Moe, Curly; and so on.
With a node order, we are making an assertion about which nodes can depend on (be the 'child' of) other nodes.
The trick is then to wander around the permutation space of node orders with a Monte Carlo Markov Chain,
comparing the posterior probability of each subset of graphs defined by the current hierarchy.
The nice thing about this approach is that it's more efficient (there are way fewer permutations
for N nodes than the number of undirected graphs --- N! versus 2^(N*(N-1)/2) to be exact --- and it characterizes what the
'correct' graph looks like when we have limited data.
Further reading in Bayesian graphical models:
The standard text is Pearl (1988) [google books]. However, I personally find it a little esoteric and not well suited for biologists with mathematical aspirations. I find the following papers to provide excellent background with lucid prose:
GE Cooper and E Herskovits (1993) A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9(4): 309-347.
N Friedman and D Koller (2003) Being Bayesian about network structure: a Bayesian appraoch to structure discovery in Bayesian networks.
Machine Learning 50(1-2): 95-125. [journal link]