r5 - 11 Oct 2006 - 12:22:59 - ArlinStoltzfusYou are here: CAMEL >  CoreEIG Web  > MeetingsNotes

CDAT Design Teleconference, 10 Oct 2006

The Webex teleconference lasted from 2:00 to about 3:00 pm and included Weigang (came late, left early), Aaron, Rutger, and Arlin.

Perl OO intricacies

Before Weigang joined us, Arlin queried Aaron and Rutger about the design of Rutger's Root.pm and use of tie-ing. This led on to other things and occupied our discussion for about 15 minutes until we got back to our goals in carrying forward the design process.

"Syntactic sugar" refers to making things easy for the applications programmer, e.g., tricks to make an object that is not an array respond rationally to shift, push, pop, etc. Bio::Phylo has some of these tricks.

Discussion of the Listable class in Bio::Phylo.

Division of labor: test writers vs. developers

At this point there was some discussion about "next steps" that led to a clarification and a division of labor (see action items). Aaron expressed the idea that we were not all at the same level in terms of proficiency in Perl and familiarity with design patterns. Arlin agreed that he lacked the experience to evaluate designs abstractly, and that he could understand things better in the context of specific applications code. Aaron suggested that Weigang and Arlin be tasked with writing the test suite, while Aaron and Rutger would develop an implementation to satisfy the tests (perhaps with modification).

At about this point Weigang muted his phone to take another call.

After this, we wrapped up our EIG business (see action item about emails) and Rutger signed off. Aaron and Arlin stayed online to talk about other business. Weigang re-joined us after 20 minutes and we related what had happened in his absence.

Action items

  • all to tag their EIG-related email messages as per email from Arlin (EIG_internal in subject line)
  • Weigang and Arlin to begin work on test suite
  • Aaron and Rutger to continue discussion on design patterns via phone or email
  • at some point we need to set a date for the next telecon

CDAT Design Teleconference, 22 Sept 2006

The Webex teleconference included Weigang, Aaron, Rutger, and Arlin.

Common source code repository

  • We all agree to use subversion for source code control, and to use the CIPRES repository that Rutger set up for us.
  • Each of us has tested this for access.
  • Weigang is still having problems. He will work these out with the sys admin.

Weigang's CDAT code

  • matrix, character data, etc.
  • script with steps
    • input alignment (AlignIO?)
    • convert to CDAT
    • read pre-computed tree with TreeIO?
    • add to CDAT
    • minor operations after this
  • note that this enforces strict matching between data and trees OTU names
  • also enforces uniqueness of names

Further discussion

  • we agree on a loose goal for people to come up with CDAT demos before next time
  • Arlin requests input on how we should approach the NESCent initiative, given that we are all on the planning committee and that the goals overlap. Should we share resources? Should we devote our telecons to hack-a-thon for now?
  • Aaron and Rutger both think that we should keep things separate

Action items

  • work on CDAT demos; check in work to svn

CDAT Design Teleconference, 30 Aug 2006

The Webex teleconference lasted from 1:30 to 3:30 pm and included Weigang, Aaron, Rutger, and Arlin.

Introduction to the CoreEIG twiki web

Arlin first introduced the re-organized Twiki site. Most feel we need to learn more to make efficient use of the Twiki tool.

Discussion of use cases

Most of the time was spent on discusssions of the use case.
  • Arlin added a section of population genetics applications, baesd on some existing pop gene software. He raised the problem of how the CDAT model should be flexible enough to deal with the haplotype, diploid genotype data.
  • Weigang suggested including "phylogenetic footprinting" as another use case (which he is free to add to the Twiki page). He also suggested to start uploading some input files for each use cases.
  • Aaron introduced two industrial pradigms of managing a software project: "extreme programming" & "refactorying". One is to start coding for a use case right away without worrying too much on the overall object design ("bottom-up"). This may create a lot of downstream problems such as backward competibility among versions. The other is the "top-down" approach of having a coherent design first. However, it assumes knowing all the possible target problems.
  • A consensus was reached for the following mixed approach: (i) select a set of core use cases, as the template for the initial CDAT design; (ii)write user-level scripts for each case, assuming having the CDAT object (a task assigned to Weigang for an intitial attempt); (iii)start implementing the CDAT object.

Logistics

  • Considering September is a busy month, we will meet for the webex teleconference every two weeks.
  • Arlin and Rutger thought it necessary to keep interested parties (Philly group, NESCENT developers) informed about our efforts & progress. This is because, at least, (i) we need community feedbacks, especially among bioPerl programmers; (ii)possible use of NESCent and other granting agency resources (e.g., a future CDAT Hackthorn). Arlin will write to Todd Vision of NESCent to update our progress.

CDAT Design Tele-Conference, 22 Aug 2006

These notes were drafted by Weigang and extended by Arlin.

The Webex teleconference lasted from 2:00 to 4:30 pm and included Weigang, Aaron, Rutger, John Bradley, and Arlin.

Target Problems/Use Cases (Twiki presentation by Arlin)

There is a general feeling that these 7 use cases compiled by Arlin cover most of the molecular evolution applications we intend the CDAT model to solve.

Discussion:

  • Weigang suggests we include a population genetics use case; Arlin agrees that, even if we decide ultimately
not to support pop gen analysis, we should still include the use case for the sake of completeness.
  • Aaron asks for more detail on the parts most relevant to CDAT design considerations
  • Arlin suggests the value of adding sample input files for use cases

Introduction to Bio::Phylo (Slide presentation by Rutger)

Some points about Bio::Phylo design and capacities:
  • tree/matrix parsing and tree drawing
  • interfaces with Bioperl through Bio::Tree::TreeI and Bio::Tree::NodeI. For example, bioperl tree objects can be imported by using the "new_from_bioperl"
  • inside-out object, i.e., fully encapsulated, object is only a key, cannot be dumped using data::dumper

Current and future directions

  • Bio::Phylo is now part of CIPRES project.
  • will interface with Bio::CDAT through Bio::CDAT::IO.

Discussion There was an extensive discussion on the problem of maintaining data integrity through multiple data modifications (e.g., OTU exclusion, character exclusion). Is the CDAT going to be just a bag of things with very few constraints, or is it going to be highly structured to enforce integrity?

Arlin suggested a simple use case to bear in mind when thinking about this: suppose we want to write an interactive graphical clustalw wrapper, where the user goes through iterations of align, infer tree, view results, prune data (by rows=seqs or columns=positions). Is each cdat a stand-alone object fully instantiated with its seq data? Or is it just a view that imposes a filter on the original unaltered complete data set? If it is just a view, is it implemented as a set of relations specifying which characters of the original data are in which cells of the derived matrix, or is it a set of edit operations to get the derived data set from the original?

A key issue, Aaron notes, is whether we want to have an "undo" function, to go back to a previous data set.

Rutger proposed an attractive solution of separating the model from the view of the data using the MVC (model view controller) design pattern. We have the choice to develop CDAT as either an MVC or just an "M", just the model. The key question is how much of the view control is going to be model-specific (i.e., something we need to develop along with the model), as opposed to being generic.

CDAT Concept Glossary* (Twiki presentation by Aaron)

Aaron presented a glossary of concepts and a short list of objectives. There was extensive discussion.

There is a consensus to use OTU names as primary keys for a CDAT object. A CDAT object can have multiple matrices and multiple trees associated with a set of OTUs.

The most debated concept is "Matrix". It seems that a 2D matrix is a limited and inconsistent representation of character data because, in a matrix, both the column order and the row order are meaningful. Instead, Arlin suggests, the core concept of a character "matrix" is really a character table in which the order of rows and columns is not important. For instance, morphology characters are unordered but columns in a molecular sequence alignment are ordered. A further abstraction of a matrix as a collection of character states may result in a more consistent and flexible representation.

There was some agreement that the base class of character data should be some kind of collection of rows and columns and not a "matrix". This is apparently how it is done in Bio::Phylo?

Following a suggestion from Aaron and Arlin, there was agreement that we need to flesh out some of these ideas with an example showing different ways to implement the character data "matrix" or table. Rutger volunteered to do this. Rutger mentioned that the experience of Wayne Maddison in implementing Mesquite is that a mistake was made and that it was better to have ordered characters (sequences) be a derived class rather than a base class.

EIG start-up meeting, 29 June 2006, Philadelphia, PA

The complete information package (warning: 12 MB) from our 29 June Philly meeting has background papers and project docs.

Below is a quick twiki translation from the previously distributed version of the meeting notes, EIG_1_notes.doc.

Preliminaries:

Supporting evolutionary analysis via BioPerl? Evolutionary Informatics Group meeting 29 June, 2006, Blockley Hall, University of Pennsylvania, Philadelphia, PA

In attendance: Barry Dancis, Jim Brown, Vivek Gopalan, Tom Hladish, Aaron Mackey, William McCaig?, Lucia Peixoto, WeiGang? Qiu, Arlin Stoltzfus

This document was written by Arlin based partly on notes from Lucia supplemented by improvisation to fill in the gaps. The meeting loosely followed an agenda and made extensive reference to a binder of supporting material. References to the documents in the binder are shaded like this.

Most of us arrived at Gia Pronto before 10:00. We telephoned Weigang and William to find that, due to flooding-related delays, their train had just left NYC. We proceeded to Blockley Hall at about 10:00 where we met Barry, set up the room and equipment, and began the meeting at 10:30. Weigang and William arrived at about 11:20. Introductory remarks

What we hope to accomplish

Arlin began with some unscheduled remarks about what we hope to accomplish

  • Central role of comparative analysis (vs. a priori approaches)
  • Problem: powerful applications (e.g., PHYLIP) but weak informatics. If applications A, B, C have standardized input and output, and a programmable interface, then with proper glue code we can chain them together into an automated pipeline. Currently not possible (e.g., interfaces like PAML suck)
  • Approach: use cases → design principles → implementation → evaluation (for an example, see the paper by Nakleh, et al)
  • Strategy: leverage existing BioPerl? framework (see Bio::CDAT proposal)
  • Targets for support (see NESCent proposal):
    • standard exchange format for data model (NEXUS or successor)
    • import from foreign file formats into data model
    • relational schema for data model
    • control (of analysis software)
    • editing
    • visualization

Aaron objected that visualization is not part of informatics and that we should not address it. Arlin agreed that visualization is not part of “informatics” but argued it was part of providing software support for the data model and for evolutionary analysis, and that some of us remain interested in visualization tools regardless of definitional categories. Use-cases in comparative analysis

Use cases

Predetermined list of cases

Arlin presented several different cases. Jim and Weigang spoke briefly on specific cases, and Lucia spoke at more length on genome-scale phylogenies. This session went for a long time, until past noon. Sometimes the discussion diverged into defining software problems and considering possible solutions.

“Functional” inference Eisen has pointed out the importance of distinguishing relationships of orthology from paralogy in order to improve classification of genes into “functional” categories. Eisen and Wu point out that “function” can be treated as a character state. Thus “functional inference” is formally a problem of inferring a missing character state using the rest of the (non-missing) data.

Lucia objected that sometimes there is non-orthologous gene replacement. Arlin did not see this as a reason to abandon phylogenies and evolutionary models. He suggested that the evolutionary approach (ideally) embodies three principles:

  1. Data not independent, but related by a tree
  2. Dynamics of change along branches (edges) reflect evolutionary genetics
  3. Applying this framework yields probabilistic inferences

Some methods are not evolutionary, e.g., “entropy” comes from information theory and is usually applied as a measure of “conservation” in a way that ignores the tree; likewise, an HMM alignment treats each sequence as an independently generated time-series.

Lucia pointed out that the tree may be a network, not technically a tree. Arlin noted that he did not intend to exclude this possibility and that we should keep it in mind.

Lucia objected that whole genomes sometimes are compared and there is no general model for this. Arlin did not agree that this changes things in any way.

Maybe I (Arlin) should point out that my definition refers to broad methodological issues, not to details. The alternative to #1 is to treat data as independent samples, e.g., as in my example with “entropy”. But even if we choose to use a tree, we do not have to interpret it, via principle #2, as the representation of a dynamic process governed by rules based on evolutionary genetics. Creationists or Hennigian cladists would not accept this interpretation (the latter see the tree as a kind of most-economical description of the data, not as an inferred path of descent). Principle #3 is that these dynamics are sufficiently stochastic that we must always consider uncertainty.

“Detecting positive selection”, as with the PAML software. Input is sequence alignment, tree, output is a columnwise value, the replacement/synonymous ratio.

Morphological data may be coded as discrete characters (even if the observations are on a continuous scale, as in Schols, et al) and analyzed using evolutionary methods.

Analysis of introns (or, in principle, any other gene features) Qiu, et al. use the phylogeny for the gene family as the basis for analyzing intron loss and gain. This illustrates binary character states, NEXUS files, integrating different types of data.

Kinases. Jim joined in to describe this (see unmarked figure after Qiu, et al. in your package of documents). For purposes of analyzing drug resistance, we want to visualize inhibitor data (negative log IC50 values are shown on this figure) relative to the tree, and we want to analyze it relative to the protein sequence and structure.

Arlin pointed out that this figure, also the figures from Qiu, et al., were made directly from a NEXUS file using the nexplot software originally written by Weigang.

Population analysis. Weigang described some of his work with Borrelia, the bacterial agent of lyme disease, using another figure from the document packet. Phylogenies are used to organize and interpret sequence data from different loci. The patterns indicate recombination.

There was some discussion about thresholds that are used in various cases. In this case, isolates are assigned to the “same” haplotype for a locus if they are within 90 % (?) sequence similarity.

Genome-level phylogeny Lucia described some of her work on genome-level phylogeny. Different types of methods are used (handout). Lucia developed a pipeline. Tree manipulations were done in Bio::Phylo. Trees are compared to identify events of lateral gene transfer. Only complete genomes were used to avoid sampling artifacts. Challenges for integration and interoperability

Envisioning large-scale integrative projects

The agenda called for us to envision other use-cases and to imagine large integrative projects. We got distracted from this, but there was some discussion.

As a conversation-starter, Arlin suggested a project based on research interests of a colleague (a structural biologist). The idea is to find Giardia proteins that could be used in a strategy for structure-based drug design strategy: the targets have to be essential (compare to yeast essential genes or worm RNAi lethal genes), sufficiently dissimilar to human homologs to avoid cross-reactions (use sequence similarity or orthology analysis to address this), and they must be soluble (not membrane) proteins for structure analysis. This would be done on the whole genome, and the result would be a web service to access the results and create custom rankings of targets. This could be a demonstration project to automate a practical and integrative kind of bioinformatics analysis.

Jim objected that this was not a very good strategy.

There was some discussion about what would constitute good frameworks for testing the automated informatics approach (Demostration project):

  • importing and manipulating of MSA
  • importing and manipulating of trees
  • adding accessory data (structure, function, and essentiality)
  • calculate evolutionary parameters (conservation)

It was about at this point that we took a break for lunch, at about 1:00.

Available technology and standards

The agenda called for a presentation of some different areas of technology and standards that would be relevant to providing software support for evolutionary analysis.

CDAT

Arlin began this by pointing out that underlying evolutionary methods is an implicit data model that may be called the “character-state data model”. Aaron and Arlin call it “CDAT” for “character data and trees”. This data model is exemplified in the MacClade? interface, and by the examples we discussed earlier (molecular or morphological characters; continuous or discrete characters). It goes hand-in-hand with a methodology of collect observations → homologize (assign states to characters for each “OTU”) → analyze phylogenetically.

NEXUS (Maddison, et al., 1997)

NEXUS is the de facto standard exchange format for the CDAT model. It has a structure consisting of blocks and commands. Programs may designate “private” blocks to define a control interface. The commands of public blocks are pre-defined. Few implementations of the NEXUS API are complete. Some problems:
  • no references (only literal names for everything)
  • no way to store ancestral states
  • more generally, no generic method to annotate nodes and branches
  • Newick string

Trees and networks

Tom presented some brief examples of funky tree strings from applications that infer rootings and duplications. Tom gave each of you some printings to be clipped into the last “ADDITIONS” section of your notebooks.

Arlin pointed out that these applications require non-standard input and output formats, but the only purpose of the non-standard parts is to annotate nodes and branches. A more generic format such as PhyloXML?, or any data model in which the nodes and edges are objects, will fulfill the same purpose.

At this point there was some discussion about trees. It was pointed out that a list of nodes and edges can be used to specify any graph, including graphs that are trees, whereas the Newick string is suited only to trees.

Edit history

Also this was interspersed with some discussion about what features we want to build into a CDAT interface. Aaron presented the idea of having an edit history. For purposes of clarity, I (Arlin) am going to put this material together at the end of the document.

Relational back-end storage

Aaron presented the issue of relational storage, referring to the hypothetical TreeBase? II schema and the paper by Nakleh, et al., 2003. Arlin suggested that the schema could not be correct because some of the arrows were wrong. Aaron suggested that this was a hypothetical schema and not the real thing. Aaron gave some examples of how the schema works. This seems to be a very “heavyweight” schema with individual tables for rows, columns and cell values. Trees are represent in graph forms as a set of nodes and edges.

Visualization.

William presented some of his work to visualize alignments. The layout is based on the Bio::Graphics concept of panel with tracks with glyphs. Columns, rows, or cells can be colored.

Aaron suggested that if the alignment part itself could be a glyph, then this could be integrated into the rest of Bio::Graphics. Arlin pointed out that all this needs is to have a glyph interface to the MSA visualizer, then it’s a glyph.

Goals, strategies and the big picture

We did not come up with a definitive list of design principles. Here I will try to summarize miscellaneous issues that came up in the course of the day, relating to the big picture, and to general goals and strategies.

Awareness of prior work and related efforts

  • NEXUS is a full-featured format for the character-state data model
  • BioPerl? has Tree, Align, Matrix and Run modules, with uneven support
  • Rutger’s Bio::Phylo addresses many of our main concerns
  • CIPRES has similar goals to ours
    • Narrower because it is focused mainly on TOL-related issues
    • Broader because of interest in C, C++, and Java support
  • CIPRES database team has developed a heavyweight schema for TreeBase?

Possible design principles and open questions

  • interoperable with NEXUS and newick but not limited (i.e., back-compatible)
  • interface to relational back end
  • integrates with Bioperl, Bio::Phylo interfaces
  • interface to CIPRES server (CORBA)
  • provides utility methods, e.g., select nodes by properties.(subset manipulation)
  • allows for integration of diverse types of data (variants; protein structure; expression; modifications; features; interaction data and other genetic data)
  • defines relationships between data types (e.g., DNA→ → protein, translate on-the-fly, DNA → sequence features)
  • Defines useful rules that will govern software implementation (eg: things we know biologically)
  • Aware of logical dependencies among elements, i.e., inferences such as a tree depend on a particular part of the data
  • include a history mechanism? (does Mesquite support this? if so, let’s study it)
    • Derive each version from previous version
    • Use private block?
    • Allow purge
  • Support graphs that are not trees? (what does this entail? How much support?)

Strategies for implementing, testing and promoting useful software

  • Analyze use-cases in order to derive and prioritize design principles
  • Come up with a list of queries or operations to support, as in Nakleh, et al.
  • Be aware of existing technologies when devising an implementation strategy
  • Have a test or contest to compare different implementations of a problem

-- ArlinStoltzfus - 28 Aug 2006

Show attachmentsHide attachments
Topic attachments
I Attachment Action Size Date Who Comment
docdoc EIG_1_notes.doc manage 55.0 K 22 Nov 2006 - 20:08 ArlinStoltzfus  
pdfpdf Bio-Phylo-arch.pdf manage 1161.8 K 22 Nov 2006 - 20:08 RutgerVos Bio::Phylo doxygen api documentation
pdfpdf Bio-Phylo-docs.pdf manage 2219.9 K 22 Nov 2006 - 20:08 RutgerVos Design discussion and manual
pdfpdf Bio-Phylo-slides.pdf manage 28.6 K 22 Nov 2006 - 20:08 RutgerVos Bio::Phylo powerpoint slides 22/8/06
Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r5 < r4 < r3 < r2 < r1 | More topic actions
 
CAMEL TWiki home
This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CAMEL? Send feedback