Building logically consistent taxonomies

In observational sciences like geology, descriptions of the entities of scientific interest are commonly constructed using properties that have categorical (terminological) values. For example, a rock description will likely use terms fine-grained, coarse-grained, well-rounded, euhedral, porphyritic, or granoblastic, as well as mineral names like quartz, plagioclase, biotite, calcite, to name but a few. In order for software applications to work with such data, datasets must use a vocabulary in which string labels or identifiers for concepts or categories are consistently used and are understood by the software. This understanding, or semantics, in information system terms, means that the software follows rules for inferencing that are consistent with scientific meaning and logic for the represented concept.

In order for data that utilize categorical (terminological) values for properties to be exchanged and utilized by multiple software applications (interoperability), either datasets must use the same vocabulary, understood by all the applications, or there must be a mapping between terms in the vocabularies used in the datasets so each application can work with its own vocabulary.

An important aspect of many of the systems of terms (concept schemes) used to describe natural phenomena is a hierarchy from broader to narrower concepts. For example, in rock classification, we have igneous rock > plutonic rock > granitoid > diorite > quartz diorite. In such a hierarchy, each broader concept includes multiple more specific concepts. A challenge constantly faced in the construction and presentation of these concept schemes is that there are typically multiple approaches to arranging a concept hierarchy, because there can be multiple broader categories for a given concept. For example, the mineral galena is both a sulfide mineral, and a cubic mineral; equally valid hierarchies of mineral species can be constructed by grouping on chemistry first then on crystallography or vice versa.

Another aspect of the categories in a concept scheme is that they are often defined based on multiple properties. A given mineral species name (category, term) implies information about several physical properties like color and hardness, optical characteristics, chemical composition and crystal structure. A rock name has implications about mineralogy, kind of grains (clastic, crystalline, fossil), grain size, fabric, as well as physical properties like density and magnetic susceptibility. A sedimentary environment category has implications that might include climate, water depth, depositional processes, or relationships to water bodies or mountains.

The ACE software package is designed to build vocabularies for use in artificial intelligence applications. In particular, we are interested in a matching problem: given a description of the geology and other relevant parameters for some area, and a set of models for predicting some phenomenon of interest, what is the likelihood that a particular phenomenon will be observed, or alternately, what is the most likely model that corresponds to the observed description. Practical applications are identification of mineral deposit targets based on a collection of deposit models and geologic map data or predicting the likelihood of a landslide event based on a collection of landslide triggering models, geologic map data, and real time climatic or seismic events.

In order to address these problems, we are utilizing an Aristotelian approach to definitions of terms in vocabularies used to construct geologic descriptions. Concepts are defined through binding to a more general concept (genus) and using a consistent set of differentiating properties (differentia) to distinguish concepts with the same genus. Property value vocabularies, used to specify the differentia, are constructed in the same fashion. Property values might be complex, i.e. having their own set of differentia, or atomic, i.e. defined using a single differentiating property axis. The atomic property values are simpler and less problematic to define in an unambiguous way. The asserted property values and their hierarchical relationships can then be used to infer hierarchical relationships between concepts they are used to define. The resulting vocabulary is easier to maintain and accounts for multiple parent (broader concept) relationships.

Construction of a vocabulary using this approach allows testing for logical problems in the definitions. For example, errors in specification of differentiating properties might result in inferring parent-child relationships that do not make scientific sense, or in having orphan concepts that do not have expected 'parent' concepts. Because each vocabulary term is linked to the properties that define it, different hierarchical views can be constructed by grouping on properties in different order, appropriate to different applications.

What do we mean by 'taxonomy'?

A taxonomy is a hierarchy of concepts. In this discussion, we use the term taxonomy to mean a hierarchical system of concepts in which the child concepts all have a 'kind of' relationship to the parent concept. In such a hierarchy, any instance of a child concept is also an instance of its parent concept. Other kinds of hierarchies can be constructed using different relationships like 'part of', 'contains', 'derived from', or 'biological child of', but these are not taxonomies. `

A taxonomy has a top concept that is the most general kind of entity included in the system. The children of a specific concept in the taxonomy are referred to as siblings, following the family metaphor that is typically used. Many taxonomies have a simple tree structure, in which each concept (except the top concept) has exactly one parent. A directed acyclic graph is a more general hierarchical structure in which a concept might have multiple parent concepts. A real world taxonomy with a tree structure is the biological classification of living organisms in the Linnaean taxonomy.

What is an 'Aristotelian Taxonomy'?

An Aristotelian definition [Smith, 2003b] of concept A is of the form "An A is a B such that C", where B is a more general concept than A and C is a condition that defines how A is differentiated amongst the sub-concepts of B. Aristotle [350 B.C.] called the general concept B the genus and the condition C the differentia (Poole and Smyth, 2011).

Aristotle anticipated many of the issues that arise in definitions: If genera are different and co-ordinate, their differentiae are themselves different in kind. Take as an instance the genus "animal" and the genus "knowledge". "With feet", "two-footed", "winged", "aquatic", are differentiae of "animal"; the species of knowledge are not distinguished by the same differentiae. One species of knowledge does not differ from another in being "two-footed"

Note that "co-ordinate" here means neither is subordinate to the other. Genera is the plural of genus.

The scope of the taxonomy is determined by the definition of the top concept in the hierarchy, and by the set of properties used to differentiate sibling concepts beneath each parent concept. Child concepts are defined by restricting the range of one or more property values on the parent genus. Note that the differentia for child concepts beneath different parents might be different, but the same set of properties must be used to differentiate siblings.