- Aaron's Bio::CDAT implementation considerations
- Email discussions
- Subject: character state matrix api
- Date: Wed, 12 Jul 2006 15:01:55 -0700, Rutger Vos <rvosa@sfu.ca>
- Date: Thu, 13 Jul 2006 08:33:08 -0400, aaron.j.mackey@gsk.com
- Date: Thu, 13 Jul 2006 13:54:14 -0700, From: Rutger Vos <rvosa@sfu.ca>
- Date: Sat, 15 Jul 2006 13:19:35 -0400, From: aaron.j.mackey@gsk.com
- Date: Mon, 17 Jul 2006 18:06:16 -0700, From: Rutger Vos <rvosa@sfu.ca>
- Date: Tue, 18 Jul 2006 09:42:51 -0400, From: aaron.j.mackey@GSK.com
- Date: Tue, 18 Jul 2006 13:19:14 -0700, From: Rutger Vos <rvosa@sfu.ca>
- Date: Wed, 19 Jul 2006 08:57:06 -0400, From: aaron.j.mackey@gsk.com
- Date: Wed, 19 Jul 2006 18:17:56 -0700, From: Rutger Vos <rvosa@sfu.ca>
- Subject: Bio::Align::AlignI not suited, right?
- Subject: Character flyweight sketch
- Date: Mon, 24 Jul 2006 16:27:47 -0700, From: Rutger Vos <rvosa@sfu.ca>
- Date: Tue, 25 Jul 2006 09:36:18 -0400, From: aaron.j.mackey@gsk.com
- Date: Wed, 26 Jul 2006 12:39:00 -0400, From: arlin.stoltzfus@nist.gov
- Date: Wed, 26 Jul 2006 12:52:54 -0400, From: aaron.j.mackey@gsk.com
- Date: Wed, 26 Jul 2006 09:59:09 -0700, From: Rutger Vos <rvosa@sfu.ca>
- cdat prototyping
- CDAT design consideration: Mediator pattern
- next topic
- Post mortem
Aaron's Bio::CDAT implementation considerations
(not a complete list! topics for discussion!)
1) composited objects should be BioPerl-ready, if not BioPerl-native (though, as long as CDAT can itself consume and produce BioPerl-ready objects, this stricture need not apply)
2) granularity choice of the (potentially probabilistic) CharStateMatrix object is a concern for memory use and parsing/object-construction speed (lazy evaluation? binary encoding? Flyweight pattern?)
3) cardinality of composited datatypes: one or many CharStateMatrices per CDAT? one or many Trees per CDAT (or per CharStateMatrix)? Arguments exist for both ... CDS/protein "duality" case, but any others? If cardinality > 1 in any dimension, does the API become too complicated? Rutger argues that a CDAT is the intermediate in a many-to-many relationship between Matrices and Trees ... another idea for consideration: CDAT's may themselves be composites of related (sub-)CDATs ...
4) Bio::CDAT construction/IO: de novo, flatfile (Bio::NEXPL), relational (GSK::PhyloDB::CDAT); choices here affect answers to #2 above, and vice versa
5) should components implement Bio::AnnotableI? Bio::LocatableI? Bio::RangeI? Bio::LocatableSeq?
6) do CDAT objects manage mutation of underlying components? via an Observer pattern?
Email discussions
Subject: character state matrix api
Date: Wed, 12 Jul 2006 15:01:55 -0700, Rutger Vos <rvosa@sfu.ca>
Date: Wed, 12 Jul 2006 15:01:55 -0700
From: Rutger Vos <rvosa@sfu.ca>
Subject: character state matrix api
Hi all,
the following is a sketch of an api for character state matrices. It
inherits from Bio::Matrix::MatrixI, adds data type functionalities and
nexus tokens, matrix operations as in Bio::Matrix::GenericMatrix and
implements a "Bio::CDAT::ContainedObjectI" interface (only methods:
get_cdat/set_cdat). All feedback welcome!
#####################################################################
New methods in Bio::Phylo::Matrices::CharMatrixI (future name:
Bio::CDAT::CharMatrixI or Bio::Matrix::CharMatrixI). These methods are
accessors - i.e. read only - that map onto the respective nexus tokens
by the same name. The idea is that this will be easy to remember, and
handy for insertion, for example, in Template Toolkit templates for
nexus/html/xml writing in an MVC context:
- datatype -- dna|rna|protein|standard|continuous|restriction
- symbols -- array ref of single character symbols
- missing -- the missing data symbol, usually '?' or 'N'
- gap -- the gap symbol, usually '-'
- ntax -- number of rows in matrix
- nchar -- number of columns in matrix
- charstatelabels -- column labels
- matrix -- raw matrix as two-dimensional array
New methods for data integrity:
- set_charstate_lookup -- set character state lookup hash
- get_charstate_lookup -- get character state lookup hash
Methods inherited from Bio::CDAT::ContainedObjectI. The idea is that
internally, the $cdat->add_matrix($matrix) method could check whether
$matrix->isa('Bio::CDAT::ContainedObjectI').
- get_cdat -- get the cdat container
- set_cdat -- set the cdat container
Methods inherited from Bio::Matrix::MatrixI. These methods are accessors
for generic matrices/tables:
- matrix_id -- primary key
- matrix_name -- string, nexus token legal, i.e. single quoted with
spaces or underline-separated
- get_entry($rowname,$columnname) -- get single cell identified by
$rowname and $columnname
- get_column($col) -- get column identified by $col
- get_row($row) -- get row identified by $row
- get_diagonal -- PROBABLY IRRELEVANT/IMPRACTICAL
- column_num_for_name($name) -- get internal index of column identified
by $name
- row_num_for_name($name) -- get internal index of row identified by $name
- num_rows -- SAME AS "ntax"
- num_columns -- SAME AS "nchar"
- row_names -- character sequence names
- column_names -- SAME AS "charstatelabels "
Methods like (but probably not inherited from)
Bio::Matrix::GenericMatrix. These methods are mutators for generic
matrices/tables:
- add_row($row) -- adds row $row to matrix
- remove_row($row) -- removes row $row from matrix
- add_column($col) -- adds column $col to matrix
- remove_column($col) -- removes column $col from matrix
Date: Thu, 13 Jul 2006 08:33:08 -0400
From:
aaron.j.mackey@gsk.com
Subject: Re: character state matrix api
In-reply-to: <44B57153.1060600@sfu.ca>
Rutger, I'm excited to see someone laying out an API, but it would help me
(at least) to see your overall vision for class hierarchy and
relationships (perhaps in UML or something more lightweight), and then
start discussing API. In my comments/questions below, I'll try to infer
your underlying "design" from your external API description, but I may
have it wrong, so forgive me.
>
New methods in Bio::Phylo::Matrices::CharMatrixI (future name:
>
Bio::CDAT::CharMatrixI or Bio::Matrix::CharMatrixI). These methods are
>
accessors - i.e. read only - that map onto the respective nexus tokens
>
by the same name.
why read only? some of these I could imagine being "mutators" (e.g.
$cdat->missing("?") would cause all N's to convert to ?'s if
$cdat->missing() eq "N"). Also, if it's going to be read-only, then I
would prefer to see "get_*" accessor names, with explicitly missing
"set_*" mutator methods.
>
* datatype -- dna|rna|protein|standard|continuous|restriction
the "original" CDAT relational data model allowed mixed datatypes (e.g. a
binary intron presence/absence state could be embedded in the CDS sequence
at the position at which the intron may occur); I realize that the
representation of this in Nexus flat file format must be via separate
matrix objects, but does that necessarily limit CDAT matrices?
>
* symbols -- array ref of single character symbols
or maybe "alphabet"? related to the above, the symbols/alphabet structure
may differ per column if mixed datatypes are allowed ... so this could be
perhaps the "default" alphabet of the matrix, though a given column in the
matrix may utilize a difference alphabet (or none at all, if continuous)
>
* missing -- the missing data symbol, usually '?' or 'N'
>
* gap -- the gap symbol, usually '-'
>
* ntax -- number of rows in matrix
err, num_otus? or num_rows?
>
* nchar -- number of columns in matrix
num_chars? num_cols? num_columns?
in general the prefix "n" is not very meaningful; one of the most loved
(and hated) aspects of
BioPerl? are the various method aliases that have
arisen for just these differences in style and expectation.
>
* charstatelabels -- column labels
I'm not sure what this is; is this related to charset's? what if a column
belongs to more than one charset?
>
* matrix -- raw matrix as two-dimensional array
>
New methods for data integrity:
>
>
* set_charstate_lookup -- set character state lookup hash
>
* get_charstate_lookup -- get character state lookup hash
I don't know what a "character state lookup hash" is (well, I can guess,
but probably won't be entirely correct); what did you have in mind?
>
Methods inherited from Bio::CDAT::ContainedObjectI. The idea is that
>
internally, the $cdat->add_matrix($matrix) method could check whether
>
$matrix->isa('Bio::CDAT::ContainedObjectI').
>
>
* get_cdat -- get the cdat container
>
* set_cdat -- set the cdat container
this seems like part of an inside-out design (which isn't necessarily bad,
I just want to make sure I understand the design); so instead of a
Bio::CDAT having matrices (or trees, or whatever else a CDAT contains),
you want the ability to "back reference" the CDAT object directly from the
matrix? Do you want to do this via soft-references (i.e. the Bio::CDAT
truly contains the matrices, and the matrices have a back-reference for
convenience), or does the Bio::CDAT truly not itself know what the
associated matrices are (not good, I'd think).
Also, doesn't this mean that I can't easily instantiate a plain old
Bio::Align::AlignI matrix and call $cdat->add_matrix($alignment) without
declaring Bio::Align::AlignI ISA Bio::CDAT::ContainedObjectI? This seems
unnecessarily prohibitive; I'd rather rebless the matrix into a new
derived subclass that contains the get_cdat/set_cdat methods, if those
backreferencing methods are so important to have.
>
Methods inherited from Bio::Matrix::MatrixI. These methods are accessors
>
for generic matrices/tables:
>
>
* matrix_id -- primary key
>
* matrix_name -- string, nexus token legal, i.e. single quoted with
>
spaces or underline-separated
>
* get_entry($rowname,$columnname) -- get single cell identified by
>
$rowname and $columnname
for a CDAT, I'd like to see get_entry expanded to consider the notion of
charsets (i.e. give me the entry that corresponds to a particular position
in a charset, not the entire matrix).
>
* get_column($col) -- get column identified by $col
ditto comment as above re: charsets
>
* get_row($row) -- get row identified by $row
>
* get_diagonal -- PROBABLY IRRELEVANT/IMPRACTICAL
agreed.
>
* column_num_for_name($name) -- get internal index of column identified
>
by $name
>
* row_num_for_name($name) -- get internal index of row identified by
$name
>
* num_rows -- SAME AS "ntax"
>
* num_columns -- SAME AS "nchar"
OK, so these are your stylistic aliases
>
* row_names -- character sequence names
>
* column_names -- SAME AS "charstatelabels "
>
Methods like (but probably not inherited from)
>
Bio::Matrix::GenericMatrix. These methods are mutators for generic
>
matrices/tables:
>
>
* add_row($row) -- adds row $row to matrix
>
* remove_row($row) -- removes row $row from matrix
>
* add_column($col) -- adds column $col to matrix
>
* remove_column($col) -- removes column $col from matrix
here's where the fun starts; what happens if you execute these methods on
a matrix already associated with a CDAT, and that CDAT already has
associated tree(s)?
I think the "rebless into CDAT-aware subclass" idea almost has to happen
to be able to intercept these calls and either a) try to cascade the
action if (easily) possible or b) throw a consistency error.
Thus, we'll also need a way to disassociate a matrix from its CDAT object
to "restore" it's normal base functionality.
Thanks for the solid thinking, hopefully this discourse remains fruitful.
-Aaron
Date: Thu, 13 Jul 2006 13:54:14 -0700, From: Rutger Vos <rvosa@sfu.ca>
Date: Thu, 13 Jul 2006 13:54:14 -0700
From: Rutger Vos <rvosa@sfu.ca>
Subject: Re: character state matrix api
Hi Aaron,
thanks for the reply! Below I will go through the points you're making
one by one trying to explain myself a bit more, but here I'll say a bit
more about "the vision thing"
I, like everyone else, like APIs that are sensibly laid out and easy to
remember. I agree that accessors and mutators should be explicitly
separated as get_* and set_* methods.
BioPerl? sometimes violates this,
by using variable argument lists, such as $node->branch_length( $length
) and $node->branch_length() for setting and getting, respectively. We
can't really change these APIs, they've ossified, they're "stable"
For our problem space, we also have another type of ossified API, namely
that of the nexus syntax. For character state matrices, there's a bunch
of tokens that many phylogeneticists can recite: 'datatype', 'ntax',
'nchar', 'symbols', 'missing', etc.
If I try to imagine how people would used a shared API we design, I can
see many people wanting to use a parser library from one package to
obtain objects from a file, and serialize it in some way - to the cipres
architecture, to a different data file format, to a visualization
format, to another internal data structure.
The things I will want to know about an object once I've received it
from a parser and query it for serialization will probably be the same
things the programs for which nexus was designed (and those that have
adopted it subsequently) want to know: how many rows, how many columns,
what sort of symbols can I expect and what do they mean, what do the
rows mean. The tokens to indicate that in nexus ('ntax', 'nchar',
'symbols', 'datatype', an implicit or explicit 'link' to a 'taxa' block)
are in my view part of the "traditional", stable terminology. I think
the API we design will suffer if we replace these with
long-but-consistent names that will be soul destroying to type out every
time ('get_num_rows', 'get_num_columns', 'get_matrix_symbols',
'get_matrix_data_type' etc.).
We are, after all, talking about perl programming, and Perl has a 'grep'
function.
There is already some friction between these two forces (consistency
versus convention), but there is a third force acting on the design:
having to fit into
BioPerl?'s inheritance tree. The way I see it, we
could fit in like this:
Matrices
Bio::CDAT::CharMatrixI (now under discussion) would be the main
interface for character state matrices. Ideally, this would be an
interface that is relatively easy to implement in NEXPL and Bio::Phylo,
so that either can function as a parser back end for Bio::CDAT::IO. The
Bio::CDAT::CharMatrixI interface inherits from
BioPerl?'s
Bio::Matrix::MatrixI so that objects from the CDAT parser architecture
are available to other
BioPerl? modules that want matrices.
The matrix would be comprised of character sequence objects (basically,
an encapsulated matrix row), for which we probably need a
Bio::CDAT::CharSeqI interface.
Trees
We can use Bio::Tree::TreeI, which should be fairly easy to implement in
NEXPL. Likewise, for nodes we can use Bio::Tree::NodeI, which NEXPL
would have to implement. This would make the tree parsers available for
the IO back end.
Taxa
There needs to be some notion like the 'taxa' block in nexus files.
Taxon objects are basically encapsulated names to which sequences and
nodes can link in some way for disambiguation purposes.
So that's the direction in which I'm thinking. Now, specifically:
aaron.j.mackey@gsk.com wrote:
>
Rutger, I'm excited to see someone laying out an API, but it would help me
>
(at least) to see your overall vision for class hierarchy and
>
relationships (perhaps in UML or something more lightweight), and then
>
start discussing API. In my comments/questions below, I'll try to infer
>
your underlying "design" from your external API description, but I may
>
have it wrong, so forgive me.
>
>
> New methods in Bio::Phylo::Matrices::CharMatrixI (future name:
>
> Bio::CDAT::CharMatrixI or Bio::Matrix::CharMatrixI). These methods are
>
> accessors - i.e. read only - that map onto the respective nexus tokens
>
> by the same name.
>
>
>
>
why read only? some of these I could imagine being "mutators" (e.g.
>
$cdat->missing("?") would cause all N's to convert to ?'s if
>
$cdat->missing() eq "N"). Also, if it's going to be read-only, then I
>
would prefer to see "get_*" accessor names, with explicitly missing
>
"set_*" mutator methods.
>
I'm on the fence here - I'd like to strike a balance between
"consistent" and "easy to remember". In the context of stringifying a
matrix to nexus (or another conceptually similar format) it'd be nice to
have all the nexus tokens available. For example, in a template for the
template toolkit, you could do:
########################
begin characters;
dimensions ntax=[% matrix.ntax %] nchar=[% matrix.nchar %];
format datatype=[% matrix.datatype %] missing=[% matrix.missing %]
gap=[% matrix.gap %] symbols=[% matrix.symbols %];
charlabels [% matrix.charlabels %];
matrix
....
########################
...if you pass it a $matrix that is a
CharMatrixI? object. On the other
hand, for Bio::Phylo I have religiously stuck to get_* and set_*
methods, and if any of the above methods are "setters" too, that should
be made explicit in the method names (rather than through 'dual usage'
overloading, i.e. with/without arg). I would just hate to have to type
get_num_taxa every time I want to know how many rows are in the matrix

>
>
> * datatype -- dna|rna|protein|standard|continuous|restriction
>
>
>
>
the "original" CDAT relational data model allowed mixed datatypes (e.g. a
>
binary intron presence/absence state could be embedded in the CDS sequence
>
at the position at which the intron may occur); I realize that the
>
representation of this in Nexus flat file format must be via separate
>
matrix objects, but does that necessarily limit CDAT matrices?
I forgot "mixed". It should be allowed. Note that nexus files for
mrbayes have a mixed data type (essentially concatenated matrices of dna
and standard, though).
>
> * symbols -- array ref of single character symbols
>
>
>
>
or maybe "alphabet"? related to the above, the symbols/alphabet structure
>
may differ per column if mixed datatypes are allowed ... so this could be
>
perhaps the "default" alphabet of the matrix, though a given column in the
>
matrix may utilize a difference alphabet (or none at all, if continuous)
Again, this was for consistency with nexus tokens, as are the names below:
>
> * missing -- the missing data symbol, usually '?' or 'N'
>
> * gap -- the gap symbol, usually '-'
>
> * ntax -- number of rows in matrix
>
err, num_otus? or num_rows?
>
> * nchar -- number of columns in matrix
>
num_chars? num_cols? num_columns?
>
>
in general the prefix "n" is not very meaningful; one of the most loved
>
(and hated) aspects of BioPerl? are the various method aliases that have
>
arisen for just these differences in style and expectation.
>
>
> * charstatelabels -- column labels
>
>
I'm not sure what this is; is this related to charset's? what if a column
>
belongs to more than one charset?
>
I meant the "charlabels" nexus token (i.e. column names).
>
> New methods for data integrity:
>
>
>
> * set_charstate_lookup -- set character state lookup hash
>
> * get_charstate_lookup -- get character state lookup hash
>
>
I don't know what a "character state lookup hash" is (well, I can guess,
>
but probably won't be entirely correct); what did you have in mind?
We need to be able to specify how the different symbols in a matrix map
onto each other. For example, for restriction data, state '0' only ever
maps onto '0', and '1' maps onto '1', i.e. both are unambiguous symbols.
The '?' symbol could mean either '0' or '1'; the '-' symbol means
neither. A hash that describes this is:
my $lookup = {
'-' => [],
'0' => [ '0' ],
'1' => [ '1' ],
'?' => [ '0', '1' ],
};
It indicates what we actually
mean w.r.t. "missing" data and "gaps".
For this data type this is not very complex, but think of how the IUPAC
single character ambiguity symbols map onto each other: a rather bigger
hash. For all datatypes (other than continuous) we can define a default
hash in the classes, and users can get and set a new one, perhaps
merging default hashes from different data types for "mixed" matrices.
Here's why we need this: i) symbols can be validated by checking whether
they exist as keys in the hash; ii) if, while parsing a matrix, you come
across "{ac}" (mrbayes) or "a&c" (mesquite) you can lookup the symbol
that maps onto [ 'A', 'C' ] and use that internally; iii) by
implementing an internal notion of ambiguity we can write out different
dialects of nexus, i.e. with the {ac} construct if we're writing for
mrbayes, and a&c if we're writing for mesquite; iv) the hashes can be
modified/merged - e.g. for "mixed" data we could specify a hash that
combines the "dna" and "standard" hashes.
(Mesquite and paup do things internally like this as well, albeit with
some multidimensional array jiggery-pokery.)
>
> Methods inherited from Bio::CDAT::ContainedObjectI. The idea is that
>
> internally, the $cdat->add_matrix($matrix) method could check whether
>
> $matrix->isa('Bio::CDAT::ContainedObjectI').
>
>
>
> * get_cdat -- get the cdat container
>
> * set_cdat -- set the cdat container
>
>
this seems like part of an inside-out design (which isn't necessarily bad,
>
I just want to make sure I understand the design); so instead of a
>
Bio::CDAT having matrices (or trees, or whatever else a CDAT contains),
>
you want the ability to "back reference" the CDAT object directly from the
>
matrix? Do you want to do this via soft-references (i.e. the Bio::CDAT
>
truly contains the matrices, and the matrices have a back-reference for
>
convenience), or does the Bio::CDAT truly not itself know what the
>
associated matrices are (not good, I'd think).
I think $node needs to be able to find out whether $charseq belongs to
the same Bio::CDAT container. The Bio::CDAT container would be some kind
of array, so it could get at its contents and know what the associated
matrices/trees/etc are, but it'll be handy if the contained objects can
get at their container also. Perhaps just via their ID, not via actual
references - as you suggested earlier (also to prevent issues with
cyclical references and memory leaks, I realized later).
>
Also, doesn't this mean that I can't easily instantiate a plain old
>
Bio::Align::AlignI matrix and call $cdat->add_matrix($alignment) without
>
declaring Bio::Align::AlignI ISA Bio::CDAT::ContainedObjectI? This seems
>
unnecessarily prohibitive; I'd rather rebless the matrix into a new
>
derived subclass that contains the get_cdat/set_cdat methods, if those
>
backreferencing methods are so important to have.
Sure, I can't think of any name clashes right now, so objects contained
by Bio::CDAT could perhaps be duck-typed by $obj->can('set_cdat'). Part
of the point was that the CDAT container should be able to figure out
whether what you're trying to add to it is a good idea or not, without a
cascade of if/else statements.
>
> Methods inherited from Bio::Matrix::MatrixI. These methods are accessors
>
>
>
> for generic matrices/tables:
>
>
>
> * matrix_id -- primary key
>
> * matrix_name -- string, nexus token legal, i.e. single quoted with
>
> spaces or underline-separated
>
> * get_entry($rowname,$columnname) -- get single cell identified by
>
> $rowname and $columnname
>
>
for a CDAT, I'd like to see get_entry expanded to consider the notion of
>
charsets (i.e. give me the entry that corresponds to a particular position
>
in a charset, not the entire matrix).
Sounds good. Could be a third positional argument, I guess?
>
> * num_rows -- SAME AS "ntax"
>
> * num_columns -- SAME AS "nchar"
>
>
OK, so these are your stylistic aliases
Yup, there'll inevitably be some redundance/aliasing.
>
> Methods like (but probably not inherited from)
>
> Bio::Matrix::GenericMatrix. These methods are mutators for generic
>
> matrices/tables:
>
>
>
> * add_row($row) -- adds row $row to matrix
>
> * remove_row($row) -- removes row $row from matrix
>
> * add_column($col) -- adds column $col to matrix
>
> * remove_column($col) -- removes column $col from matrix
>
>
>
>
here's where the fun starts; what happens if you execute these methods on
>
a matrix already associated with a CDAT, and that CDAT already has
>
associated tree(s)?
Adding columns I can't see having a great effect on associated trees,
but here's how things work inside Bio::Phylo:
If you add a row, that row is a datum object that is either identified
by a name (string) or a taxon object. The taxon object is contained by a
taxa container. If you insert the datum object in the matrix object, the
matrix will check whether the datum object holds a reference to a taxon,
and if it does, whether it belongs to the right taxa container. Matrices
and trees can both reference the same taxa container, so that you get an
architecture like in a nexus file.
>
I think the "rebless into CDAT-aware subclass" idea almost has to happen
>
to be able to intercept these calls and either a) try to cascade the
>
action if (easily) possible or b) throw a consistency error.
>
>
Thus, we'll also need a way to disassociate a matrix from its CDAT object
>
to "restore" it's normal base functionality.
Agreed.
>
Thanks for the solid thinking, hopefully this discourse remains fruitful.
>
>
-Aaron
Any further feedback very welcome.
Best wishes,
Rutger
Date: Sat, 15 Jul 2006 13:19:35 -0400, From: aaron.j.mackey@gsk.com
Date: Sat, 15 Jul 2006 13:19:35 -0400
From:
aaron.j.mackey@gsk.com
Subject: Re: character state matrix api
>
For character state matrices, there's a bunch
>
of tokens that many phylogeneticists can recite: 'datatype', 'ntax',
>
'nchar', 'symbols', 'missing', etc.
Fair enough, I'm convinced.
>
the API we design will suffer if we replace these with
>
long-but-consistent names that will be soul destroying to type out every
>
time ('get_num_rows', 'get_num_columns', 'get_matrix_symbols',
>
'get_matrix_data_type' etc.).
yep, agreed.
>
_Taxa_
>
There needs to be some notion like the 'taxa' block in nexus files.
>
Taxon objects are basically encapsulated names to which sequences and
>
nodes can link in some way for disambiguation purposes.
OK, this makes more sense to me now; I was still thinking about the
original CDAT relational model, that had no separate Taxa/OTU entity, only
sequences and nodes directly linked. but having a separate taxa object
will make things more flexible.
>
For example, in a template for the template toolkit, you could do:
>
########################
>
begin characters;
>
dimensions ntax=[% matrix.ntax %] nchar=[% matrix.nchar %];
>
format datatype=[% matrix.datatype %] missing=[% matrix.missing %]
>
gap=[% matrix.gap %] symbols=[% matrix.symbols %];
>
charlabels [% matrix.charlabels %];
>
matrix
>
....
>
########################
Yes, this is nice, but I worry about defining an API based on one
particular file format data-structure.
Besides, in your example, how does [% matrix.symbols %] and
matrix.charlabels interpolate into the file, with quotes, commas,
whitespace, etc? I'm not concerned about data input/output formats being
so tightly bound to the API.
>
>> * charstatelabels -- column labels
>
I meant the "charlabels" nexus token (i.e. column names).
I guess I'm not familiar with this; is this just the numbers 1 to [nchar]?
>
>> * set_charstate_lookup -- set character state lookup hash
>
>> * get_charstate_lookup -- get character state lookup hash
>
>
>
We need to be able to specify how the different symbols in a matrix map
>
onto each other. For example, for restriction data, state '0' only ever
>
maps onto '0', and '1' maps onto '1', i.e. both are unambiguous symbols.
>
The '?' symbol could mean either '0' or '1'; the '-' symbol means
>
neither. A hash that describes this is:
>
>
my $lookup = {
>
'-' => [],
>
'0' => [ '0' ],
>
'1' => [ '1' ],
>
'?' => [ '0', '1' ],
>
};
>
>
Here's why we need this: i) symbols can be validated by checking whether
>
they exist as keys in the hash; ii) if, while parsing a matrix, you come
>
across "{ac}" (mrbayes) or "a&c" (mesquite) you can lookup the symbol
>
that maps onto [ 'A', 'C' ] and use that internally;
ahh, this is presumably because you require each "cell" in the matrix to
be scalar. again, one of the goals of CDAT is to go beyond this
simplifying assumption and allow states to be probabilistic across the
defined alphabet, for both observed states and inferred ancestral states.
Why would an observed state be probabilistic? For exactly the "ambiguity"
reasons you define above, and others (sequence trace/assembly quality
scores, suspect mutations, etc.). This is particularly relevant for
applications such as SNPs where you might want to remember the fraction of
the population that has this vs. that allele at a particular position.
So if we don't do this internal "translation", we don't need these hashes;
for validation all we need is the alphabet/symbols method (specified
earlier).
>
(Mesquite and paup do things internally like this as well, albeit with
>
some multidimensional array jiggery-pokery.)
yep, I think that jiggery-pokery may be in our game-plan as well.
Of course, this may be a schism point between a Bio::CDAT::MatrixI and a
Bio::CDAT::ProbabilisticMatrixI (which ISA Bio::CDAT::MatrixI), which
would also be fine.
>
>> Methods inherited from Bio::CDAT::ContainedObjectI. The idea is that
>
>> internally, the $cdat->add_matrix($matrix) method could check whether
>
>> $matrix->isa('Bio::CDAT::ContainedObjectI').
>
>>
>
>> * get_cdat -- get the cdat container
>
>> * set_cdat -- set the cdat container
>
>
>
I think $node needs to be able to find out whether $charseq belongs to
>
the same Bio::CDAT container.
I think that's fine, for utility/sanity checking.
>
Sure, I can't think of any name clashes right now, so objects contained
>
by Bio::CDAT could perhaps be duck-typed by $obj->can('set_cdat'). Part
>
of the point was that the CDAT container should be able to figure out
>
whether what you're trying to add to it is a good idea or not, without a
>
cascade of if/else statements.
I see your point, but I fear any solution that requires some other
monolithic project (
BioPerl?) to add interfaces and/or methods to support
another arcane (though useful) project (Bio::CDAT).
>
>> * add_row($row) -- adds row $row to matrix
>
>> * remove_row($row) -- removes row $row from matrix
>
>> * add_column($col) -- adds column $col to matrix
>
>> * remove_column($col) -- removes column $col from matrix
>
>
>
> here's where the fun starts; what happens if you execute these methods on
>
> a matrix already associated with a CDAT, and that CDAT already has
>
> associated tree(s)?
>
>
>
Adding columns I can't see having a great effect on associated trees,
except that if the tree was inferred from the matrix, adding a new column
negates/outdates the current inference.
>
but here's how things work inside Bio::Phylo:
>
If you add a row, that row is a datum object that is either identified
>
by a name (string) or a taxon object. The taxon object is contained by a
>
taxa container. If you insert the datum object in the matrix object, the
>
matrix will check whether the datum object holds a reference to a taxon,
>
and if it does, whether it belongs to the right taxa container. Matrices
>
and trees can both reference the same taxa container, so that you get an
>
architecture like in a nexus file.
OK, again you're thinking about discrete manipulations of the
datastructure (which is good), but I'm thinking about possible utility. If
I call remove_row() will that "cascade" to an equivalent remove_node()
call in the associated tree? I guess in my head there are two "scenarios"
under which data manipulation occurs: construction (in which I don't care
so much about referential integrity until I'm all done) and analysis (in
which I do care very much if I stupidly do something that
invalidates/outdates some other piece of information I've also carefully
constructed).
>
> I think the "rebless into CDAT-aware subclass" idea almost has to happen
>
> to be able to intercept these calls and either a) try to cascade the
>
> action if (easily) possible or b) throw a consistency error.
One further thought on this is that we might consider using an Observer
design pattern to do this: one (or more?) Bio::CDAT objects are registered
as listeners to the events that occur on Bio::CDAT::ComponentI's (instead
of
ContainedObjectI?'s); thus the component gets to know its CDAT(s), the
CDAT(s) gets to control (via callbacks) its components (and interfere when
something bad happens), etc. We'd still need to be able to directly
access components via the CDAT object, so some amount of cyclic
referencing will be necessary, but weak referencing is pretty stable in
Perl nowadays.
>
> Thus, we'll also need a way to disassociate a matrix from its CDAT object
>
> to "restore" it's normal base functionality.
With the Observer pattern, this is simply a de-registration, no need to
un/re-bless.
-Aaron
Date: Mon, 17 Jul 2006 18:06:16 -0700, From: Rutger Vos <rvosa@sfu.ca>
Date: Mon, 17 Jul 2006 18:06:16 -0700
From: Rutger Vos <rvosa@sfu.ca>
Subject: Re: character state matrix api
Hi all,
in my head, I've summarized the exchanges on this thread as:
- we need a character state matrix api, AlignI? is too specialized,
MatrixI? too general;
- there is some debate about the exact aesthetics of getters and setters;
- the matrix will have to be able to maintain type safety, and hold
meta data about ambiguity/uncertainty;
- the matrix needs to fit into a more general CDAT architecture for
naming, i.e. how do we know that a sequence and a node refer to the same
entity?
- the matrix needs to fit into an architecture for maintaining
referential integrate, so that state changes can cascade from one
CDAT-contained object to another.
To start with the last item: Aaron mentioned the observer pattern. If I
understand correctly, in this case it would mean that the main CDAT
object is the observer, and that matrices, trees, nodes (etc.?) are
subjects that register with the observer, and notify it when they change
state (but please chime in if that's not how it would work). I think we
can also see this in a caching context, i.e. in the implementation,
objects might store intermediate calculation results and keep them until
the object changes state (perhaps through a call by the CDAT object?).
On the second-to-last item: perhaps the CDAT object maintains a pool of
all the unique names/ids of all entities that it observes? Earlier
upthread we talked about whether a cdat object was essentially { _trees
=> [], _matrices => [] }, but I think we were worried we'd end up with
cdat object becoming a big soup of unrelated entities. If we've added a
matrix to the cdat object, surely we will want to map in some way the
names of sequences in that matrix onto matching names in a tree we
subsequently add. Maybe CDAT is something like a hash with names as keys
and as values some structure to define the objects by that name (perhaps
holding object references or ids).
W.r.t. type safety and metadata: I am worried about speed/memory
requirements - can we just relegate this to either the character
sequence object or a matrix subclass?
Rutger
(Below are some more specific responses.)
>
>
Yes, this is nice, but I worry about defining an API based on one
>
particular file format data-structure.
There are so many nexus files out there, I think many people will be
very happy if their contents can be made available through a simple API.
I agree, though, that this may not necessarily be provided by 'core'
cdat - perhaps by nexpl? It would just be aliasing the getters by the
same name, I guess.
>
Besides, in your example, how does [% matrix.symbols %] and
>
matrix.charlabels interpolate into the file, with quotes, commas,
>
whitespace, etc? I'm not concerned about data input/output formats being
>
so tightly bound to the API.
array references are interpolated into a space separated string of their
contents. I think symbols need to be double quoted in nexus, so there'd
have to be quotes around that in the template. Charlabels aren't quoted.
>
>>> * charstatelabels -- column labels
>
>>>
>
> I meant the "charlabels" nexus token (i.e. column names).
>
>
I guess I'm not familiar with this; is this just the numbers 1 to [nchar]?
It's like taxlabels: a list of names. In molecular matrices they are
often skipped.
>
ahh, this is presumably because you require each "cell" in the matrix to
>
be scalar. again, one of the goals of CDAT is to go beyond this
>
simplifying assumption and allow states to be probabilistic across the
>
defined alphabet, for both observed states and inferred ancestral states.
Individual cells would have to be pretty compressed one way or another.
The big idea here was that this is done by mapping all ambiguity onto
single character symbols. Another way that I know of is by packing the
states into bitvectors, e.g. A: 1000, C: 0100, G: 0010, T: 0001, N:
1111, (which may or may not be surprisingly efficient depending on how
perl rounds bytes). Whatever it is, I don't see every "cell" become a
complex data structure, not to mention an object. It'll be impossible in
terms of memory requirements. A matrix for 200 taxa, 2000 characters?
It'll take ages to parse.
>
Why would an observed state be probabilistic? For exactly the "ambiguity"
>
reasons you define above, and others (sequence trace/assembly quality
>
scores, suspect mutations, etc.). This is particularly relevant for
>
applications such as SNPs where you might want to remember the fraction of
>
the population that has this vs. that allele at a particular position.
>
>
So if we don't do this internal "translation", we don't need these hashes;
>
for validation all we need is the alphabet/symbols method (specified
>
earlier).
It'd definitely be nice if metadata could be attached to each individual
cell, but I'm just worried how this would work out memory-wise. I tried
out character objects for bio::phylo, and it was really slow. In any
case, though, that's more an issue for the charseq interface, not for
the matrix that contains it.
>
> (Mesquite and paup do things internally like this as well, albeit with
>
> some multidimensional array jiggery-pokery.)
>
>
yep, I think that jiggery-pokery may be in our game-plan as well.
>
>
Of course, this may be a schism point between a Bio::CDAT::MatrixI and a
>
Bio::CDAT::ProbabilisticMatrixI (which ISA Bio::CDAT::MatrixI), which
>
would also be fine.
Sounds good.
>
>> Thus, we'll also need a way to disassociate a matrix from its CDAT object to "restore" it's normal base functionality.
>
>
With the Observer pattern, this is simply a de-registration, no need to
>
un/re-bless.
Could you say about more about how this would work? $matrix->register(
$cdat ); after which point the cdat gets notified about any change in
the matrix so that it can cascade changes in other objects?
Date: Tue, 18 Jul 2006 09:42:51 -0400, From: aaron.j.mackey@GSK.com
Date: Tue, 18 Jul 2006 09:42:51 -0400
From:
aaron.j.mackey@GSK.com
Subject: Re: character state matrix api
>
Individual cells would have to be pretty compressed one way or another.
>
The big idea here was that this is done by mapping all ambiguity onto
>
single character symbols. Another way that I know of is by packing the
>
states into bitvectors, e.g. A: 1000, C: 0100, G: 0010, T: 0001, N:
>
1111, (which may or may not be surprisingly efficient depending on how
>
perl rounds bytes). Whatever it is, I don't see every "cell" become a
>
complex data structure, not to mention an object. It'll be impossible in
>
terms of memory requirements. A matrix for 200 taxa, 2000 characters?
>
It'll take ages to parse.
This is exactly what the Flyweight pattern is for. The classic example is
a word processor application that "somehow" has to keep track of the
independent state characteristics (font, size, color, embellishment) of
each and every character in a document, possibly tens of thousands.
>
> With the Observer pattern, this is simply a de-registration, no need
to
>
> un/re-bless.
>
>
>
>
>
Could you say about more about how this would work? $matrix->register(
>
$cdat ); after which point the cdat gets notified about any change in
>
the matrix so that it can cascade changes in other objects?
Here's one way to do it which maintains the original $matrix object, but
"wraps" it with a CDAT-savvy matrix object:
$cdat->add_matrix(\$matrix) entails:
sub add_matrix {
my ($self, $matrix) = @_;
# replace $matrix with a CDAT-compatible matrix object that
# delegates to original $matrix
$$matrix = Bio::CDAT::Component::Matrix->new($$matrix);
$$matrix->register($cdat);
push @{$self->{_matrices}}, $$matrix;
return $$matrix;
}
Now when I call $matrix->delete_column(10), I'm really calling
Bio::CDAT::Component::Matrix::delete_column(), which notifies it's
listeners, and delegates to the original $matrix's delete_column() method.
$cdat->remove_matrix(\$matrix) would do the opposite (replacing $$matrix
with the original object), as might $matrix->unregister($cdat);
There are other ways to do it that don't involve delegation (e.g.
destructively convert the original matrix object into a CDAT-savvy object,
or "decorate" the original object with CDAT-savvy methods via adding to
the object's @ISA), but each has its pros and cons.
-Aaron
Date: Tue, 18 Jul 2006 13:19:14 -0700, From: Rutger Vos <rvosa@sfu.ca>
Date: Tue, 18 Jul 2006 13:19:14 -0700
From: Rutger Vos <rvosa@sfu.ca>
Subject: Re: character state matrix api
You
are the Gang of Four and I claim my $5.
But seriously, so would the matrix contain row objects which in turn
contain flyweight cell objects? Who is to know what classes of flyweight
objects are to be instantiated? Does the matrix decide? The matrix row?
I'm curious to hear how you'd see this be organized architecturally.
Best wishes,
Rutger
Date: Wed, 19 Jul 2006 08:57:06 -0400, From: aaron.j.mackey@gsk.com
Date: Wed, 19 Jul 2006 08:57:06 -0400
From:
aaron.j.mackey@gsk.com
Subject: Re: character state matrix api
Perhaps it's time to have another little conference to discuss these ideas
more fully? I can setup a webEx teleconference (so at least we'd have
"whiteboard"-like capability) for the core group to discuss basic
implementation details.
One realization I had today about this discussion is that (at least in my
mind) we've been discussing two separate things: CDAT "native" matrix
representation/implementation vs. the CDAT matrix API. Mixed in there has
been ideas about object "IO" (e.g. $cdat->add_matrix($matrix), where
$matrix is not a file to be parsed, but some Bio::Align::AlignI-like
object), and the possibility of being able to achieve the CDAT matrix API
without reconstructing a new CDAT matrix object, but by "decorating" the
original $matrix object with the CDAT API (and on further thought, I'm not
sure that will be possible in the long run).
Regardless, let's have a "voice-to-voice" sometime soon. I'm available
all day Thursday and most of the day Friday. Let me know your
availability for the next, say, 7 working days (through July 28th).
Thanks,
-Aaron
Date: Wed, 19 Jul 2006 18:17:56 -0700, From: Rutger Vos <rvosa@sfu.ca>
Date: Wed, 19 Jul 2006 18:17:56 -0700
From: Rutger Vos <rvosa@sfu.ca>
Subject: Re: character state matrix api
In recent discussions we have had some confusion about what I mean by
"taxon". This page describes more or less what I meant (and, hopefully, why
we'd need something like that for CDAT):
http://mesquiteproject.org/Mesquite_Folder/docs/mesquite/Taxa.html
Subject: Bio::Align::AlignI not suited, right?
Date: Sun, 16 Jul 2006 00:09:58 -0700, From: Rutger Vos <rvosa@sfu.ca>
Date: Sun, 16 Jul 2006 00:09:58 -0700
From: Rutger Vos <rvosa@sfu.ca>
Subject: Bio::Align::AlignI not suited, right?
Hi all,
I want to verify with you whether you think Bio::Align::AlignI is
suitable as an interface for character state matrices. I don't think it
is, as it's too specifically dna-oriented. It's a shame it inherits
directly from the root, ideally it would be a subclass of a character
data matrix. Can you see
CharMatrixI? essentially be a subset of
Bio::Align::AlignI, so that in the future maybe we can convince bioperl
to make Bio::Align::AlignI inherit from it (so that alignments can be
used by cdat directly)?
Rutger
Date: Sun, 16 Jul 2006 19:20:50 -0400, From: aaron.j.mackey@gsk.com
Date: Sun, 16 Jul 2006 19:20:50 -0400
From:
aaron.j.mackey@gsk.com
Subject: Re: Bio::Align::AlignI not suited, right?
Yep, I think I can agree with all of that, except the very last bit about
CDAT using
AlignI?'s directly - I'm content for an
AlignI? to not be
immediately CDAT-usable, just as a Bio::SeqI won't ever be immediately
CDAT-usable, I think we're always going to have to reinterpret these more
basic objects in the context of the CDAT data model, bestowing various
functionalities not inherent to the native object.
The first step to making something happen in
BioPerl? is to check out the
CVS code, make the change (alter
AlignI?'s ISA to include
CharMatrixI?), and
then see what breaks in the test suite (of course, there are no tests,
yet, to ensure that a particular implementation class of
AlignI? fully
implements all methods defined in
CharMatrixI?).
-Aaron
Subject: Character flyweight sketch
Date: Mon, 24 Jul 2006 16:27:47 -0700, From: Rutger Vos <rvosa@sfu.ca>
Date: Mon, 24 Jul 2006 16:27:47 -0700
From: Rutger Vos <rvosa@sfu.ca>
Subject: Character flyweight sketch
Hi all,
attached is a sketch for a flyweight character class, my interpretation
of what Aaron meant a few days ago.
Rutger
--Boundary_(ID_7knUJLO/HrKgAF2Wv1AmUQ)
Content-type: text/plain; name=Character.pm
Content-transfer-encoding: 7BIT
Content-disposition: inline; filename=Character.pm
package Bio::Phylo::Matrices::Character;
use strict;
use constant { CHAR => 0, AMBIG => 1, RCASE => 2, POLY => 3 };
my $cache = {};
my $chars = [];
use Bio::Phylo::Util::CONSTANT qw(_DATUM_);
use Bio::Phylo::Util::IDPool;
sub new {
my ( $class, @args ) = @_;
my ( $respectcase, $is_poly, %opt, $char, $ambig, $key ) = ( 0, 0 );
if ( not scalar @args % 2 and scalar @args > 2 ) {
%opt = @args;
$is_poly = $opt{'-polymorphism'} ? $opt{'-polymorphism'} : 0;
$respectcase = $opt{'-respectcase'} ? $opt{'-respectcase'} : 0;
$char = $opt{'-char'};
}
elsif ( scalar @args == 2 ) {
( $char, $ambig ) = ( $args[0], $args[1] );
}
elsif ( scalar @args == 1 ) {
( $char, $ambig ) = ( $args[0], uc( $args[0] ) );
}
$ambig = $opt{'-ambig'} ? $opt{'-ambig'} : [ uc( $char ) ];
$key = join '', $char, sort { $a <=> $b } @$ambig, $respectcase, $is_poly;
if ( $cache->{$key} ) {
return $cache->{$key};
}
else {
my $self = Bio::Phylo::Util::IDPool->_initialize();
$chars->[ $$self ] = [
$char,
$ambig,
$respectcase,
$is_poly,
];
bless $self, $class;
$cache->{$key} = $self;
return $self;
}
}
sub set_char {
my ( $self, $char ) = @_;
$self = __PACKAGE__->new(
'-char' => $char,
'-polymorphism' => $self->is_polymorphic(),
'-respectcase' => $self->is_case_sensitive(),
'-ambig' => $self->get_ambig_lookup(),
);
return $self;
}
sub set_ambig_lookup {
my ( $self, $ambig ) = @_;
$self = __PACKAGE__->new(
'-char' => $self->get_char(),
'-polymorphism' => $self->is_polymorphic(),
'-respectcase' => $self->is_case_sensitive(),
'-ambig' => $ambig,
);
return $self;
}
sub set_case_sensitivity {
my ( $self, $cs ) = @_;
$self = __PACKAGE__->new(
'-char' => $self->get_char(),
'-polymorphism' => $self->is_polymorphic(),
'-respectcase' => $cs,
'-ambig' => $self->get_ambig_lookup(),
);
return $self;
}
sub set_polymorphism {
my ( $self, $poly ) = @_;
$self = __PACKAGE__->new(
'-char' => $self->get_char(),
'-polymorphism' => $poly,
'-respectcase' => $self->is_case_sensitive(),
'-ambig' => $self->get_ambig_lookup(),
);
return $self;
}
sub get_char { $chars->[ $$_[0] ]->[ CHAR ] }
sub get_ambig_lookup { $chars->[ $$_[0] ]->[ AMBIG ] }
sub is_case_sensitive { $chars->[ $$_[0] ]->[ RCASE ] }
sub is_polymorphic { $chars->[ $$_[0] ]->[ POLY ] }
sub _container { _DATUM_ }
################################################################################
package main;
use Data::Dumper;
my @dna = qw(A C G T);
my @array;
push @array, Bio::Phylo::Matrices::Character->new($dna[int(rand(scalar @dna))]) for ( 0 .. 100000 );
print Dumper( \@array );
--Boundary_(ID_7knUJLO/HrKgAF2Wv1AmUQ)--
Date: Tue, 25 Jul 2006 09:36:18 -0400, From: aaron.j.mackey@gsk.com
Date: Tue, 25 Jul 2006 09:36:18 -0400
From:
aaron.j.mackey@gsk.com
Subject: Re: Character flyweight sketch
Yeah, without quibbling over what the various get/set methods are (or what
might be missing), this is the general idea - keep a cache of very
lightweight objects that all share a common underlying (possibly large)
datastructure.
What you're missing (and usually isn't discussed in the Flyweight pattern
documentation) is a bulk constructor: i.e. if you were actually parsing an
entire matrix and wanted to "preload" $chars in one bulk import. Of
course, this would involve changing your identification pattern to one
more rooted in the structure (i.e. using (row, column) tuples as a
computable index into $chars) such that you don't also unnecessarily fill
the cache with all the blessed scalars.
But again, I thought we were going to wait to have an actual discussion
before plunging into so much coding? And that you remained unavailable
for such discussions until September?
-Aaron
From:
stoltzfu@umbi.umd.edu
Subject: Re: Character flyweight sketch
Date: July 26, 2006 12:39:00 PM EDT
On Jul 25, 2006, at 9:36 AM,
aaron.j.mackey@gsk.com wrote:
>
But again, I thought we were going to wait to have an actual discussion
>
before plunging into so much coding? And that you remained unavailable
>
for such discussions until September?
I am having mixed feelings about where we are going here. On the one hand,
it rarely hurts to actually DO something, and obviously Rutger is getting things
done and in one sense, we don't want to slow him down by trying to get a group
consensus on everything. Its possible that he is going to solve all of our problems
while we stand on the sidelines watching. On the other hand, we don't want to
develop too much without a plan, and our priorities also include:
1. developing specific use cases to serve as target problems
2. developing a spec for CDAT-BioPerl integration
3. making plans for our next meeting.
My opinion is that if people want to start coding on an individual basis, that's
great, but we should think of this as an experimental branch to be tested against
a spec that is not yet fully developed.
With respect to #3, I will send a separate message today. I am working on the
kinases case for #1. More later,
Arlin
Date: Wed, 26 Jul 2006 12:52:54 -0400, From: aaron.j.mackey@gsk.com
Date: Wed, 26 Jul 2006 12:52:54 -0400
From:
aaron.j.mackey@gsk.com
Subject: Re: Character flyweight sketch
Yes, I agree entirely. Rutger, please forgive my earlier message for
sounding far too stodgy. As I've mentioned before, I truly appreciate
your enthusiasm (both for discussion and actual coding). My (small)
concern is ending up in a situation where we feel "locked in" to a
particular implementation because of a significant effort that went into
it. But in this "prototyping" stage, code does have the advantage (over
"blather") of being concrete and testable.
-Aaron
Date: Wed, 26 Jul 2006 09:59:09 -0700, From: Rutger Vos <rvosa@sfu.ca>
Date: Wed, 26 Jul 2006 09:59:09 -0700
From: Rutger Vos <rvosa@sfu.ca>
Subject: Re: Character flyweight sketch
Hi all,
sorry for stirring up a panic with that code sample
The way I think about programming problems is sometimes fairly
bottom-up, so playing around with smaller components in the system helps
me get an idea as to how they might sensibly interact and what the
higher level architecture might be like.
Please don't read too much into the ideas I am throwing at you right
now, I'm thinking out loud.
By the way, on the topic of use cases, have we nominated NEXPL's set of
nexus test files for the test suite yet? That looks like a great acid
test for the IO system.
Rutger
cdat prototyping
next message
CDAT design consideration: Mediator pattern
next message
next topic
next message
Post mortem
This section is an attempt to condense the preceding. (Please step in here to change things, but the goal is to keep things in this section on a "need to know" basis. A one page "executive summary". --
RutgerVos?)
Requirements
The requirements of a character matrix object that materialize in the preceding
include:
- BioPerl? compatible: the matrix may be composed of BioPerl? utility objects, and the overall CoreEIG architecture will consume BioPerl? objects. As BioPerl? has no useful character matrix interface, we design one here following BioPerl?'s interface style.
- Bio::CDAT compatible: there will be a link between the cdat object and the matrix. This connection maintains referential integrity between biological data objects involved in an analysis.
- Data integrity for sophisticated types: the matrix will be heavily annotatable, will contain mixed data types, will have a mechanism to define probabilistic character states, true polymorphism, and uncertainty between categorical states.
- Multiple IO types: the matrix will be populated from a variety of data sources, such as flat files, database streams, webservice/corba connections.
Interface design
The discussion also touched on the API design. The consensus was that the interface should (at least) follow
BioPerl?'s interface style, or probably be more explicit about the distinction between "getters" and "setters".
Implementation
Several implementation details were discussed:
- A way to specify correspondence between ambiguous data types through a lookup hash.
- A callback system through which the cdat object can maintain data integrity, akin to the Observer pattern.
- A Flyweight pattern implementation for characters in a matrix.
--
ArlinStoltzfus - 28 Aug 2006
--
RutgerVos - 30 Aug 2006