CAMEL logo Computational and Analytical Molecular Evolution Lab at CARB Nexplorer
People Publications Software Opportunities Links Internal

NEXUS-related Projects

NOTE: our object-oriented NEXUS API in Perl is now called "Bio::NEXUS" (was "NEXPL"). Please get Bio::NEXUS from CPAN.
EF intronsContents of this pageHSP70 amino acids

Phyloinformatics software: Bio::NEXUS (a NEXUS API), nexplot and nextool

We have released a Bio::NEXUS package comprising Bio::NEXUS, an object-oriented NEXUS applications programming interface (API) in Perl, along with two demonstration applications, nexplot (which generated the images on this page), and nextool (a scriptable editor). If you haven't seen a NEXUS file or a nexplot before, and you want to find out more, see the example below and try out the nexplorer server (a joint project of our group and the Evolutionary Bioinformatics Lab of Weigang Qiu at Hunter College), which offers a subset of the editing/visualization capabilities of nexplot and nextool. Also, we have recently opened access to the nexplorer server which provides a graphical interface to some of the methods in Bio::NEXUS, along with access to thousands of pre-computed sequence family data sets.

Alignment slice visualized by nexplot
Click image to view PDF in a separate window
NEXUS file (pruned from original for display purposes)
block names in bold; commands underlined
#NEXUS
BEGIN TAXA;
  DIMENSIONS ntax=26;
  TAXLABELS  O_volvulus_AAB64227.1 O_volvulus_AAB64226.1 C_elegans_AAF39759.1 C_elegans_AAA83577.1 
    S_cerevisiae_CAA89634.1 C_albicans_AAC12872.1 S_pombe_CAB57444.1 N_crassa_AAA63780.1 M_musculus_AAA40121.1 
    C_capitata_AAA57249.1 D_virilis_CAA32060.1 D_erecta_AAF23595.1 D_orena_AAF23594.1 D_teissieri_AAF23599.1 
    D_yakuba_AAF23598.1 D_melanogaster_AAF50095.1 D_mauritiana_AAF23597.1 D_sechellia_AAF23596.1 
    D_simulans_CAA33720.1 Z_mays_AAB49913.1 O_sativa_AAC14464.1 O_sativa_AAC14465.1 A_thaliana_AAF99769.1 
    P_tremuloides_AAD01605.1 A_thaliana_BAB09468.1 A_thaliana_AAD29823.2;
END;
BEGIN CHARACTERS;
  DIMENSIONS ntax=26 nchar=30;
  FORMAT  datatype=protein gap=- missing=?;
  CHARLABELS 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 
    113 114 115 116 117 118 119 120;
  MATRIX
    M_musculus_AAA40121.1       QGTIHFEQKASGE--PVVLSGQITGLTE-G
    C_capitata_AAA57249.1       KGTVHFEQQDAKS--PVLVTGEVNGLAK-G
    N_crassa_AAA63780.1	        KGTVIFEQESESA--PTTITYDISGNDPNA
       --stuff deleted here-- 
    D_simulans_CAA33720.1       KGTVFFEQESSGT--PVKVSGEVCGLAK-G
    S_cerevisiae_CAA89634.1     SGVVKFEQASESE--PTTVSYEIAGNSPNA
    S_pombe_CAB57444.1	        SGVVTFEQVDQNS--QVSVIVDLVGNDANA;
END;
BEGIN ASSUMPTIONS;
  WTSET MySoapWeights  (VECTOR) = 1 1 1 1 1 1 1 1 0.83 0.8 0.8 0.8 0.8 0.8 0.71 0.71 1 1 1 1 1 1 1 1 
    1 1 1 1 1 1;
END;
BEGIN TREES;
  TREE "Cu-Zn Superoxide Dismutase" = (((((O_volvulus_AAB64227.1:0.31741,O_volvulus_AAB64226.1:0.13498):
    0.20268[1],(C_elegans_AAF39759.1:0.14579,C_elegans_AAA83577.1:0.27311):0.2533[1]):0.12655[0.98],
    ((S_cerevisiae_CAA89634.1:0.28255,C_albicans_AAC12872.1:0.25631):0.08358[0.91],(S_pombe_CAB57444.1:
    0.3159,N_crassa_AAA63780.1:0.1635):0.11954[0.97]):0.17514[1]):0.08988[0.77],(M_musculus_AAA40121.1:
    0.49149,(C_capitata_AAA57249.1:0.18945,(D_virilis_CAA32060.1:0.11453,(((D_erecta_AAF23595.1:0.00661,
    D_orena_AAF23594.1:0.00769):0.00497[0.92],(D_teissieri_AAF23599.1:0.004,D_yakuba_AAF23598.1:0.01012):
    0.0073[0.87]):0.01271[0.88],(((D_melanogaster_AAF50095.1:0.00836,D_mauritiana_AAF23597.1:0.00552):
    0.00203[0.28],D_sechellia_AAF23596.1:0.01103):0.00398[0.7],D_simulans_CAA33720.1:0.00595):0.00739[0.75]):
    0.11795[1]):0.11754[1]):0.12932[1]):0.10326[1]):0.0712[0.9],(((((Z_mays_AAB49913.1:0.05142,
    O_sativa_AAC14464.1:0.09031):0.02799[0.98],O_sativa_AAC14465.1:0.06915):0.05245[0.99],
    (A_thaliana_AAF99769.1:0.17064,P_tremuloides_AAD01605.1:0.1075):0.08023[1]):0.08596[1],
    A_thaliana_BAB09468.1:0.46052):0.06401[0.75],A_thaliana_AAD29823.2:0.42442):0.14252[0.94]);
END;
Left, phylogeny of a subset of Cu-Zn Superoxide dismutases. Right, slice of the protein sequence alignment. Above right, alignment column numbers with histogram of reliability scores. Nextool was used to extract this subset of proteins ("OTUs") and columns ("characters") from a larger file. Various features of the plot produced automatically by nexplot are customizable. The PDF linked to the above image (click to view) has right- (rather than left-) justified OTU names, a wider tree, and more space between the lines.

Nexplot and nextool can be used together to create customized publication-quality views of character data in a phylogenetic context. Nexplot has a variety of settings which you can read about in doc/nexplot.html (the perldocs). Importantly, the output of nexplot is PostScript, which means that the graphic elements all have infinite resolution. PostScript figures can be converted into graphics in other formats such as jpg or gif if necessary. So, to summarize, the advantages of nexplot are:

Installation

To install the library and the tools, download the Bio::NEXUS package at CPAN, and follow the instructions in the README file (for a custom installation, read doc/Installation.pod). Alternatively, if your system has the "CPAN" module installed, the entire package can be downloaded, built, tested and installed by issuing a single command:
perl -MCPAN -e 'install Bio::NEXUS'

Features tested, implemented-but-untested, and planned (out of date)

Status Nexplot Nextool & Bio::NEXUS (API) nexplorer (server)
Tested
  • PostScript conformance allows easy conversion
  • flexibility in some aspects of fonts and sizing
  • tree only, matrix only, or both
  • taxonomic coloring (SPANBLOCK)
  • select/exclude OTU set
  • select/exclude char set
  • alignment conversion to NEXUS
  • add Newick tree to NEXUS file
  • select OTUs or Chars
  • plot using nexplot
Untested
  • various layout & scaling features
  • re-root tree
  • re-name OTUs
Planned
  • residue coloring (e.g., Taylor's)
  • compute and display summary/consensus
  • more robust NEXUS reader
  • sub-select on internal node (subtree)
  • detect and report format errors
  • revision of SPAN block
  • more input formats for alignment conversion
  • replace names (nextool functionality exists)
  • compact and effective user interface
  • taxonomy server to allow taxonomic coloring

Why NEXUS?

Genome analysis is increasingly dependent on comparative methods of analysis. Even at the earliest stages of genome annotation, bits of a new genome sequence are searched against known sequences to mine clues useful in "functional" inferences (where does the gene start? where are the introns? what does it do?). Often these inferences are based on BLAST search results.

Though it is not widely appreciated, there already exists a sophisticated methodological framework for comparative analysis, developed over the past 40 years by systematists and evolutionary biologists, in which differences are interpreted according to probabilistic models of evolutionary divergence on a branching tree. The basic methods and concepts of comparative evolutionary biology, originally developed for morphological characters, can be applied directly to any kind of character (discrete or continuous, so long as it fits the character state data model).

In the ongoing quest to improve the accuracy and reliability of functional inferences, it is inevitable that the bioinformatics/genomics community will come to rely on these more sophisticated methods. This transition will require automatable tools for phylogenetic analysis and character reconstruction (which already exist to a large degree), portable and flexible formats for data exchange, infrastructure to facilitate integration, and better education about how to integrate probabilistic evolutionary reasoning into genome interpretation.

The NEXUS file format of Maddison, Swofford & Maddison, 1997 (Systematic Biology 46:590-621) was developed to facilitate the communication and storage of data for comparative analysis. We see it as a first step in developing a widely useful standard. We use a slightly modified version of NEXUS called "SPANDEX" as the exchange format for our own System for Phyloinformatic ANalysis (SPAN).

The NEXUS file format for comparative data

The NEXUS format conveys data organized according to the character state data model, in which the features of operational taxonomic units (OTUs) (e.g., species, individuals, genes, genomes, etc.) are observable states of underlying homologous characters. For instance, in a protein sequence alignment, proteins are the OTUs, alignment columns are characters, and amino acids (or gaps) are states. In evolutionary analysis, it is typical to consider differences as the result of state transitions that take place on branches of a tree, therefore the NEXUS file provides a means to represent a tree (in the standard Newick (a.k.a. New Hampshire) format).

The syntactic structure of a NEXUS file is as follows:

#NEXUS
begin < blockname >;
    < command > < argument > [additional argument];
    [ < another command with args >; ]
end;
[ < another block with commands > ]
Each of the pre-defined types of public blocks may appear only once. The TAXA block is the only necessary block. There are some restrictions on the ordering of blocks, and on the ordering of commands within a block. Application-specific "private" blocks are also possible. NEXUS keywords are not case-sensitive. We put names of BLOCKS in upper case here for mnemonic purposes.

Some important blocks

NameDescription
TAXAspecifies OTUs in data set
CHARACTERSspecifies characters
SETSassigns names to sets of characters or OTUs
ASSUMPTIONSspecifies assumptions for an analysis
CODONSspecifies codons and their genetic codes

Some important commands

NameBlockDescription
CharLabelsCHARACTERSlabel for a character (column)
StateLabelsCHARACTERSlabel for a state (the type of an instance of a character)
CharStateLabelsCHARACTERScombined label for a character and its states
CharSetSETSgive a name to some set of chars
TaxSetSETSgive a name to some set of OTUs
GeneticCodeCODONSspecify a genetic code
CodeSetCODONSassociate a code with a CharSet or TaxSet
TreeTREESspecify a Newick tree

A set of NEXUS files for software testing

As part of the process of developing and testing our NEXUS library, we assembled a set of NEXUS files, available as a compressed archive (NEXUS_TestSet-0.6.tar.gz). Many of these are simple files that were generated to test specific features. Others are "real" data files from various sources. An HTML description of the files, nexus-testset.html is included in the archive.

This set of files should be highly useful to anyone else developing software for NEXUS. However, please do not assume that it is exhaustive.

Extensions to NEXUS

The NEXUS file, as envisioned by Maddison, Swofford & Maddison, 1997, is quite flexible. For instance, it is possible to define an application-specific private block containing commands to be read by one application but not by others.

However, there are two modifications to the public blocks that are implemented in our current library:

more stuff not yet fully incoporated into this document

Links to some other tools for viewing trees

Ideas for a public DISPLAY block

The context of this is that we do some special coloring of OTUs, characters, and nodes/branches based on taxonomic assignments of OTUs. To carry this out, we invented a private SPAN block. This block associates OTUs with a taxonomic division. Nexplot interprets this to mean that the OTUs are to be colored according to a hard-coded scheme in SpanBlock.pm.

There are two problems with this. One is that it does not leverage the current NEXUS format by defining named sets in the SETS block. The second problem is that it makes the coloring a private matter known only to the SPAN block and the Perl module that reads it.

Let's start just by considering OTUs. Here is a generic method:

begin SETS; [NEXUS SETS block used to define sets] 
   otuset animals = fish dog cat mouse; 
   otuset plants = corn geranium; 
   otuset fungi = yeast; 
end; 

begin DISPLAY; 
   color animals red; 
   color plants green; 
   color fungi blue;  
end; 
The syntax for the color command would be:
color < otuset > [ scope ] < color_choice >; 

color_choice = < named_color > | (rgb) < rgb_vec > 
scope = [ all | names | data | tree ]  
where the otuset must be named in SETS; the color is either named or given as Red-Green-Blue values; and the scope defaults to "all", with "names" = color OTU display names, "tree" = propagate up from these OTUs by consensus, "data" = color data rows for these OTUs.