HISTORY Block
Purpose, relationship to NEXUS standard and to other blocks, intended uses
-
The HISTORY block is designed to allow representation of ancestral states. Such data arise in two sorts of contexts: simulation of evolution, and statistical inference of ancestral
states. In either case, because ancestors are defined by a tree, the representation
requires a tree.
-
In order to encompass all relevant
data, and to facilitate the case of representing simulated data, the HISTORY block
specifies a tree as well as the states at all nodes, not just ancestral
nodes, even when the terminal nodes are observed states that are represented
elsewhere in the same file, e.g., in a CHARACTERS block.
-
The HISTORY block uses a single phylogenetic tree with names applied to internal nodes
(as allowed by the current Newick standard), as well as terminal nodes,
along with a matrix of states for all nodes. The matrix command
(as in the original NEXUS standard for a CHARACTERS block MATRIX) allows plural or
indeterminate states, so as to represent states inferred by a probabilistic model, which
take the natural form of a probability distribution (e.g., 0.12 G, 0.23 A, 0.56 C, 0.09 T).
-
The HISTORY block is intended as a stand-alone public block, consistent with the
existing NEXUS standard. It is essentially a specialized CHARACTERS block that
borrows the TREE command from the TREES block. If the user wishes to make an explicit
connection to a CHARACTERS block (or to a TREES block) in the case that the HISTORY
block represents inferences made from the same set of observed states,
then a LINK command is appropriate.
-
Ultimately we wish to use the HISTORY block to communicate between the following:
- software used to generates histories, by simulation or by inference (e.g., PAML)
- database software used to store histories
- software used to analyze histories
- software used to visualize histories
Block structure and syntax
Syntax of the HISTORY block is as follows (following the published format
standard of Maddison, Swofford & Maddison, 1997).
BEGIN HISTORY;
DIMENSIONS
NTAX=number-of-taxa
NCHAR=number-of-characters
;
[FORMAT [DATATYPE={STANDARD|DNA|RNA|NUCLEOTIDE|PROTEIN}] [MISSING=symbol] [GAP=symbol] [SYMBOLS="symbol [symbol...]"]
;]
TAXLABELS taxon-name [taxon-name...];
MATRIX data-matrix;
TREE [*] tree-specification;
END;
DIMENSIONS, FORMAT, and TAXLABELS must precede MATRIX and TREE. Only one of each of these commands is allowed per block.
Commands
Commands recognized in the history block (following the published format
standard of Maddison, Swofford & Maddison, 1997).
- DIMENSIONS. This command contains information about how many taxa are present, and how many characters are present in the sequence information for each taxa. The DIMENSIONS command and all subcommands are implemented in exactly the same way as in CHARACTERS block. The NTAX subcommand of this command specifies how many taxa will be defined. The NCHAR subcommand specifies number of characters in the matrix.
- FORMAT. This command specifies the formatting of the data contained in the MATRIX command. The FORMAT command is implemented in much the same way as the FORMAT command in the CHARACTER block, except that probability data may be included. Transposed and interleaved data are not permitted. The FORMAT command recognizes the following subcommands:
- DATATYPE. This subcommand specifies the type of sequence data that is being described. If present, it must be the first subcommand in the FORMAT command. Standard data consists of any general sort of discrete character data, and this class is typically used for morphological data, restriction site data, and so on. DNA, RNA, NUCLEOTIDE, and PROTEIN designate molecular sequence data. Default is STANDARD.
- MISSING. This subcommand declares the symbol that designates missing data. The default is "?". For example, MISSING=X defines an X to represent missing data. Whitespace is illegal as a missing data symbol, as are the following symbols:
()[]{}\/,;:=*'"`<>^
- GAP. This subcommand declares the symbol that designates a data gap (e.g., base absent in DNA sequence because of deletion or an inapplicable character in morphological data). There is no default gap symbol; a gap symbol must be defined by the GAP subcommand before any gaps can be entered into the matrix. For example, GAP=- defines a hyphen to represent a gap. Whitespace is illegal as a missing data symbol, as are the following symbols:
()[]{}\/,;:=*'"`<>^
- SYMBOLS. This subcommand specifies the symbols and their order for character states and vectors of probability used in the MATRIX command. For example, SYMBOLS ="0 1 2 3 4 5 6 7" designates numbers 0 through 7 as acceptable symbols in a matrix. The default symbols list differs from one DATATYPE to another, as described under "state symbol" in the Appendix. Whitespace is not needed between elements: SYMBOLS="012" is equivalent to SYMBOLS="0 1 2". For STANDARD DATATYPE, a SYMBOLS subcommand will replace the default symbols list of "0 1". For DNA, RNA, NUCLEOTIDE, and PROTEIN DATATYPEs, a SYMBOLS subcommand will not replace the default symbols list but will add character-state symbols to the SYMBOLS list. The added symbols will be ordered at the end of the resulting SYMBOLS list. For example, SYMBOLS="LP" with DATATYPE DNA(which has a default symbols list of "ACGT") would produce a symbols list of "ACGTLP" with the added symbols at the end of the default list. If the data being described in the HISTORY block are probability data, the order of the symbols list will also define the order of the values given in the vectors of probabilities.
- TAXLABELS. The command specifies the names of the taxa. This command is implemented exactly as in the CHARACTERS block, except that a name must be present for each internal node in the tree, and these names must match exactly the names specified in the TREE command.
- MATRIX. This command specifies the actual data for the inferred ancestral states. This command is implemented in the same way as in the CHARACTERS block, except that vectors of probabilities may be supplied instead of discrete character data. If vectors of probabilities are used, they must be enclosed in parantheses and separated by commas. Information about the actual observed sequence data (i.e. sequence data for the leaf nodes) may be included as discrete character data. The co-mingling of probability data and character data is therefore permitted.
- TREE. The tree command specifies the phylogenetic tree that contains the ancestral states that the data in the HISTORY block describe. This command is implemented exactly as in the CHARACTER block, except that a name must be supplied for each ancestral node, and these names must match exactly the names supplied in the MATRIX block.
Authors
Arlin Stoltzfus & Justin Reese, borrowing liberally from the published format
standard of Maddison, Swofford & Maddison, 1997.