There has been discussion within the HIS team on ODM variable codes and the best approach to map information between ODM and the ontology. The need for and process of manually mapping variables onto the ontology can cause confusion, and we should seek a rule based, somewhat automatic solution. The issue is not so much that there is a lot of data to process (i.e. tag), but that the people being asked to do the tagging do not have a thorough appreciation for how what they do helps the system. This document is an effort to summarize some thinking about this and suggest some options for the path forward.
The issues are:
- How should concepts in the ontology be mapped onto variables in the variables table?
- Should a set of standardized variable codes be developed and if so what should this be?
Background information.
1. The ODM 1.1 functional specification and schema are on the HIS Website at
http://his.cuahsi.org/odmdatabases.html.
2. Bryan Enslein has written a term paper for a University of Texas class that suggests a set of standardized variable codes. This is at
https://webspace.utexas.edu/bje356/TheODM.mht3. Documents giving the functional specifications for HydroSeek and HydroTagger are on the HIS website at
http://his.cuahsi.org/hydroseek.html and
http://his.cuahsi.org/hydrotagger.htmlThe ODM 1.1 schema lists the following content fields for the Variables table.
- VariableName – Name of the variable
- SampleMedium – medium in which the sample or observation was made
- Speciation – chemical species used to identify how the value is expressed, e.g. "total phosphorous concentration" (name) as "P" (speciation)
- ValueType – Field observation, Laboratory observation etc.
- GeneralCategory – Climate, Water Quality, etc.
- Units (encoded as an ID that references the Units table)
- Datatype – Average, maximum, minimum etc.
- Timesupport
- Timesupportunits (encoded as an ID that references the Units table)
- IsRegular (True/False)
- NodataValue
These have been listed in order according to their utility (in my judgment) for data discovery.
Note that the fields VariableID and VariableCode have been omitted from this list because at present they do not contain content useful for data discovery. This is certainly the case for VariableID, but is a bit debatable for VariableCode where the original definition was as "Text code used by the organization that collects the data to identify the variable". As such VariableCode would have been content, but implementation and practice has resulted in the VariableCode becoming a "unique identifier used by the web services to identify variables". One of the issues with making variable codes unique is that they thereby attempt to compress all the information that is in the full set of fields that comprise a variable record into one code.
Concept mapping. This is also called tagging.
Concept mapping involves associating leaf concepts in the CUAHSI HIS ontology with Variables in ODM to facilitate the data discovery functionality of HydroSeek. An issue with this tagging is the level of granularity of information that is represented in the ontology and the level of aggregation of ODM variables that are mapped onto ontology leaf level concepts.
The following 4 options are possible:
1. Tag on VariableName. This is perhaps the simplest.
Here we would have 1 leaf concept -> Zero or more VariableNames -> One or more Variable records (identified in terms of their ID or Code)
2. Tag on unique VariableName, SampleMedium.
1 leaf concept -> Zero or more unique combinations of VariableName/SampleMedium combinations -> One or more variable Records (Identified in terms of their ID or Code)
3. Tag on unique VariableName, SampleMedium, Speciation
1 leaf concept -> Zero or more unique combinations of VariableName/SampleMedium/Speciation combinations -> One or more variable Records (Identified in terms of their ID or Code)
4. Tag on each unique variable in terms of all fields
1 leaf concept -> Zero or more variable Records (Identified in terms of their ID or Code)
Clearly there are more combinations. These simply illustrate the range of possibilities. I do not think it makes much sense to think of more than the first 5 Variables fields listed above as being useful for data discovery. Option 4 is the option currently used by HydroTagger and HydroSeek, although I have argued for the simplicity of option 1.
Standardization of Variable Codes¶
Enslein and others have argued that variable codes need to be standardized.
Reasons for standardization of variable codes
- A data manager could just use CUAHSI's standard system and not have to develop their own coding system
- Standard variable codes could be pre-mapped onto the concept ontology eliminating some of the effort (and subjectivity and education about ontologies) required in mapping variables to the ontology
Reasons for not standardizing variable codes
- Assigning and maintaining a consistent set of standardized codes across all combinations of variable fields across a federated set of ODMs is inherently intractable
- The specific standardized variable code scheme proposed by Enslein has shortcomings. Even just looking at variable names there are problems of ambiguity that I point out below. Enslein's scheme does not address how to deal with differences in the other ODM Variables table fields in a comprehensive way.
- As presently implemented (i.e. with the requirement that variable code should be unique), variable code has to be unique across all eleven combinations of fields in the variables table. Unless a coding scheme handles this unambiguously, automatic mapping to the ontology will not work.
Recommendation
The purposes and potential purposes for variable code are:
1. To provide a unique identifier for web services to call corresponding variable records from the DataValues table.
2. To provide a unique identifier to use in mapping on to the ontology
3. To record the code used by the collecting organization (such as USGS 00060 for streamflow) where this has meaning in attributing the data.
Purpose 3 has been lost with our efforts to pursue 1 and 2.
My suggestion is that we do adopt an automatic variable coding system, but that we relax the requirement that it be unique. I suggest instead that we decide the degree of granularity that is to be associated with leaf concepts in the ontology and then make variable codes unique only at this level of granularity. I suggest that option 3 that I gave above be chosen as the level of granularity for mapping onto the ontology. This means that unique combinations of VariableName, SampleMedium and Speciation, each get associated a variable code that gets mapped onto the ontology. An automated scheme could (possibly) be devised for developing codes based on these fields. This also means that Variables with the same VariableName, SampleMedium and Speciation, but that differ in one of the other Variable fields, e.g. units or timesupport would get the same variable code. Variable codes would thus no longer be unique. Web service calls that specify a variable code to identify the required variable should then be written to return all variables with the specified variable code. For example if a "Streamflow", "Water", "Not Applicable" are specified for the 3 fields, the web service would return time series of these values that include data values in both m3/s and ft3/s units. If a user wanted only the data in from the series with specific units, then units would need to be specified in the web service call. This is consistent with the way web services work now, where for example if the call getvariables(' ') is made all variables are returned, but if getvariables(variablecodelist) is made only variables on the list are returned. This suggestion would satisfy fully purpose 2 for variable codes. Purpose 1 would be lost, but with the benefit that data that is similar in the sense of being in the same place on the ontology, but different, say in units, would also be returned. Users may benefit from this. Purpose 3 would be lost, except to the extent that collecting organization codes are unique on VariableName, SampleMedium and Speciation, in which case they could be used as optional replacements to an automatically generated code from a coding scheme.