| term | definition |
Alphabet
| the possible Character States for a given column in a Character State Matrix (e.g. single-letter amino/nucleic acid codes, allele labels, numeric quantiles, etc.). In an evolutionary context, it must be possible (or at least meaningful) to mutate between some Character States defined in the Alphabet; thus, numeric measurements that take on infinite possible values (e.g. IC50s) are not immediately interpretable as Character States, but must be transmuted. |
Ambiguous States
| like Probabilistic States, but without known (or applicable) probabilities; thus, an Ambiguous State may be represented by a new Symbol from a Taxonomy that relates combinations of real Character States to their Ambiguous State counterpart. |
Bipartition
| a separation of all OTUs into two sets; a Node in a Tree represents a Bipartition between the Clade it defines, and all other OTUs. Bipartititions are used when comparing tree topologies to identify common and divergent topological features, and are encountered in many phylogenetic analyses (e.g. MrBayes, BAliPhy, consensus, TREE-PUZZLE). |
CDAT
| an acronym for Character Data And Trees, one CDAT represents the composite of multiple Taxa, an (optional) associated Character State Matrix and an (optional) Tree, as well as any accompanying annotation. Such a composite should be useful for 1) coordinated, "bound" data exchange between "consumers" and "producers" of underlying primary datatypes (including input/output from Nexus/phyloXML flatfiles or relational database storage), 2) maintainance of relational integrity between the composited data, and 3) facile co-manipulation of underlying primary data types |
Canonical Form
| a Character State is canonically represented by a particular Symbol from an Alphabet, though other Symbols may also refer to the same Character State (e.g. lower- vs. upper-case variants of the same letter). Such distinctions must be kept to allow "round trip" data IO, while the CDAT API should be sensitive to (non)use of canonical form. |
Character State Matrix
| a two-dimensional matrix where 1) every row represents (or has the potential to represent) a Node in a corresponding Tree, and 2) every column represents a particular molecular, morphological, or otherwise generic "State" of interest (e.g. a residue of primary sequence, the presence/absence of a genomic feature, an allele of a haplotype).
A row in a Matrix may correspond to either the observed States of an OTU, or the inferred ancestral States of some other Node in a Tree; because ancestral States are conditional on the Tree, a 1-to-1 relationship between Matrix (or the subset of ancestral rows) and Tree is required.
Matrix columns may be considered as ordered or unordered (in which case, they are simply a Character State Collection, though any physical representation of these states will artifically impose some order to them). The actual ordering of states is important to only a subset of phylogenetic analyses (e.g. BAliPhy's pairHMM alignment model), in which the distinction is visualization/IO requirement.
|
Clade
| a (named) grouping of OTUs. Every Node in a Tree defines a Clade, comprised of all descendants (including itself) of the Node in question. Other Clades may be externally definable, and may or may not represent a Clade observed in a Tree (i.e. directly correspond to a single Node in a Tree) |
Column Label
| an (optional) alphanumeric string that uniquely names/describes the content of a particular column |
Column/Character Set
| particular subsets of columns may be further demarcated or annotated (e.g. every third column in a Matrix representing a CDS alignment may be annotated as a Character Set, or CharSet, representing all codon 3rd position "wobble" bases). |
Coordinate System
| for Ordered Character State Matrices, it may be more convenient to annotate each OTU with a Coordinate System (e.g. Bio::RangeI) from which the Location of a given state from the OTU may be calculable (rather than annotate every consecutive State with an independent Location). |
Derived Matrix
| a "secondary" Character State Matrix, whose states are immediately derivable from a primary Character State Matrix (the canonical example is the CDS-to-protein translation, dependent upon a codon table annotation of the primary CDS matrix). |
Distance Matrix
| a matrix (not a Character State Matrix) that defines the pairwise distances between OTUs, used in Neighbor Joining analyses. |
Edge
| the relationship between two Nodes in a Tree, usually denoting some evolutionary process occurring over time |
Leaves
| terminal Nodes in a tree that correspond to OTUs. |
Location
| many molecular Character States will stem from longer biosequences in which the State is located; a Location is particular form of annotation that may be ascribed to a Character State (i.e. a Character State should be Locatable). Note that the concept of State Location applies to both Ordered and Unordered Matrices. |
Node
| an observable (though possibly ancestral) entity found in Tree |
Probabilistic States
| given the desire to capture evolutionary inferences such as ancestral State reconstructions, it is (sometimes) useful to conceptualize each cell in a Character State Matrix as a discrete probability distribution over an Alphabet of possible Character States; even "observed" OTU States may be similarly conceptualized (since molecular data is never directly observed, only inferred via experimentation, with some error rate). |
Root
| a Node representing the last, universal common ancestor of all OTUs in a Tree; Trees may or may not have such a node (i.e. may be rooted or unrooted). Also, for "network" Trees, this concept may not even apply. |
Step Matrix
| a matrix (not a Character State Matrix) that defines the relative "mutability" of Character States into other Character States, used in parsimony analyses; representable as a Taxonomy over an Alphabet. |
Taxa
| operational taxonomical units (OTUs), the entities from which Character State(s) are "observed" and taken as ground truths. NB: the use of the terms "taxa" and "taxon" herein do not refer to organismal species, genus or any other Taxonomical unit. |
Taxonomy
| a hierarchical lexicon that can be used for structured annotation of OTUs and Nodes; the NCBI "species lineage" Taxonomy and the Gene Ontology are two commonly used lexicons. A Taxonomy (or subset thereof) could also be used as a structured Alphabet (e.g. a Step Matrix). |
Tree
| a composite datatype that consists of a network topology (most often a bifurcating, directed acyclic graph, but not limited to such), and properties associated with each Node and/or Edge (e.g. branch lengths, bootstrap or posterior support, taxonomic labelling -- referred to as the "annotation"). |