Ontology Development Notes
Rick Hooper, September 11, 2008, version 2.0Purpose and Scope
. This ontology is designed to enable users to discover time series data collected at fixed points. It may ultimately have a broader application, but the design decisions should be taken to optimize this use for this class of data.
One implication of this purpose and scope is a clear distinction between data and metadata. The data are the time series collected at a fixed point. Metadata describe these data. Therefore groundwater level is the data; the aquifer the well is screened in, and the permeability of the porous media of that aquifer is metadata. Sapflow through a tree are data; the species of the tree is metadata.
Logic of Ontology.
There are two primary factors required to describe time series data collected at a fixed point: the environment in which It is measured (e.g., land, ocean, stream, lake), and the media on which it is measured (e.g., water, sediment, tissue). The data themselves can be classified as physical, chemical, or biological data and there are numerous levels to group similar data below these primary groupings.
The logic of this ontology has to be evaluated from two perspectives: that of the data publisher trying to figure out where to tag her data and that of the data user trying to discover what data are available. For the data publisher, this logic should provide an easy way to navigate to a leaf concept to which the data should be tagged. For the data user, the ontology should provide a rich set of synonyms and logical groupings that will allow a non-expert user to hone in on the data that they really want. An important advantage of ontologies is that this classification is not a strict either-or hierarchy. For example, dissolved silica data tagged by an ecologist as a micronutrient will be discovered by a geochemist looking for weathering products if the ontology is constructed to recognize that dissolved silica belongs to both groupings.
The initial box of the ontology from which all the hierarchy springs can be considered the “core concept.” What is this ontology describing? “Hydrosphere” is a recognized term in the global change dictionary (citation?), but I believe our ontology is broader than that. I would like to call it a “critical zone” ontology and have it reviewed by our colleagues in geochemistry and geomorphology. My sense is that it would capture fixed-location time series data for these disciplines.
Ontology Layers of Environment/Media.
The environment is expressed in two layers, environment and component. The environment is meant to be an exhaustive, mutually exclusive listing of possible environments. The initial list of environments is
1. Oceans (synonyms: estuaries, bays; essential character: salty water)
2. Rivers (synonyms: streams, brooks; essential character: lentic (running) freshwater)
Lakes (synonyms: reservoirs, ponds; essential character: lotic (still) freshwater)
Infrastructure (human-built water conveyances), but also data about instrumentation, such as data logger voltage)
Atmosphere (overlying all of the first 4 classes; rationale: transport in atmosphere dominates transport between atmosphere and what it overlies)
Instrumentation. Time series, such as battery voltage, recorded about instrumentation
Note that I have dropped “Cryosphere” as an environment. Glaciers are frozen rivers; snow/ice fields are permanent snow/ice features that do not move. Hence, glaciers now appear in the river environment and snow/ice fields are in the Land environment.
The first 5 Environments are made up of multiple components, as shown in the following table:
Ice/Snow Water Column BedInterface Vegetation Aquifer
Ocean Ice Floes Water Column Ocean Bed Emergent Vegetation Aquifer
River Ice/Glacier Water Column River Bed (hyporheic) Emergent Vegetation Aquifer
Lake Ice Water Column Lake bed Emergent Vegetation Aquifer
Land Seasonal or
Permanent Ice Field
Soil Vegetation Aquifer
Infrastructure Ice/Snow Water Column Bed sediment Emergent Vegetation
Metadata can be attached to the component level (e.g., soil horizon, aquifer permeability) and the environment level (e.g., Chesapeake Bay, Mississippi River, alpine cirque, lake geomorphometry).
Within each component, there are multiple media, as described in the following table:
Gas Rock/Solid Phase Ice Water liquid
Non-Aqueous Phase Liquid Tissue Organism
Ice/Snow Snow atmosphere or entrained in ice Mineral Particulates Ice Meltwater NAPL Tissue Organism
Water Column Dissolved gases Suspended Sediment
Water NAPL Tissue Organism
Bed Dissolved gases Bed Sediment
Interstitial water NAPL Tissue Organism
Soil Soil atmosphere Solid Phase Permafrost, Frozen soil Soil water NAPL Tissue Organism
Aquifer Dissolved gas Solid Phase -- Ground water NAPL Tissue Organism
For media, both tissue and organism level is included because data could be collected at either level. Examples of time-series data for tissue could be pesticide concentration is in fish tissue collected at a certain place over time (albeit in different individuals that move around); for organism data, an example is the number of individuals of a certain species in a lake over time.
The only liquid phase currently included is water. This could be expanded to include Non-aqueous phase liquids (NAPL’s) which are typically divided into those more dense and less dense than water.
Although there are slight variations in the names of media in different components and of components in different environments, and there are a few holes in the matrix, these are basically independent factors describing environment, components, and media, resulting in a 5 x 6 x 6 (180 element) cube.
The atmosphere has no components and 4 media (gas, particulates, fog/mist, precipitation). We should consult with an atmospheric scientist to see whether fog/mist is considered a class of precipitation, but from a chemical point of view, fog chemistry is considered separately from precipitation chemistry.
Ontology Layers for Data Groupings. Within each possible combination of environment, component, and media, there are three data groupings: physical, chemical, and biological data.
Physical data are simple enough (and few enough) not to require additional layers of groupings.
Chemical data are substantially more complex. Rock/Solid Phase have both chemicals adsorbed to the surface as well as being made up of chemicals in their matrix that are not present in the water phase. An arbitrary number of common chemical combinations are included in the current ontology: nutrients, heavy metals, priority pollutants, major weathering products, natural organic compounds, and synthetic organic compounds. Each of these can be further subdivided. We need to add citations (is Wikipedia good enough?) for these groupings.
Biological data has a small number of constituents, a reflection more of my ignorance than of the data complexity.
The primary grouping of physical/chemical/biological is viewed as exhaustive and mutually exclusive. All lower levels of the grouping are not exhaustive and hence require an “other” class.
Navigating the Ontology for Tagging. This structure should be logical to environmental scientists from a number of disciplines, but this assertion should be tested. The hierarchy of environment and media expands the multi-dimensional description of data that is frequently collapsed into a single variable name. This is a major source of semantic heterogeneity among data systems. One dimension we are ignoring is the units of measurement. NWIS has different parameter codes for English and metric units. Both parameters would be tagged to the same concept. The distinction is made upon downloading the data, but not at the discovery phase.
Using the Ontology for Discovery. We need to define the functionality required for searching. The ontology as developed makes the most critical distinctions required to orient the user: what environment do you want data from and what media are the data measured in.
The current HydroSeek search begins with a space/time box and includes a single keyword. An important decision to be made is how we handle the keyword. It would be desirable (and, I think, necessary) to be able to search across all levels of the ontology (e.g., dissolved nitrogenenous nutrients (i.e., media=water) in groundwater [but not streams, lakes, estuaries, etc.). We need to figure out how to flatten this structure into a single keyword or to maintain some number of dimensions to the search. One approach would be a three-dimensional search: Environment/Component, Media, Chemical Data Grouping. This seems logical to me and better than flattening the search to a single dimension.
Michael suggest an alternative approach. We continue to search on data grouping, as now, but report back a hierarchical list with Environment-Component-Media-Data Grouping-Result. This retains the simplicity of the interface (and a single-dimensional search, but conveys the critical information about the time-series data that is needed to avoid misinterpretation.
We also need to review the ODM structure. The variable table includes speciation (e.g., dissolved Si as SiO2), units, sample medium, data type (e.g., continuous, cumulative), is regular, time support, time units, value type/category [Dave-Are these the same) (measured, modeled), and General Category. The Sample Medium CV is very close to the proposed ontology. General Category is similar to, but not the same as Environment. The other ODM variable table elements are not needed to search for in the discovery phase, I believe. It might be nice to find sites that have continuous nitrate data in the water column, but perhaps that should be a separate search engine from Hydroseek.