CUAHIS-HIS
RSS

Navigation





Quick Search
»
Advanced Search »

PoweredBy

HIS Desktop Database

(With contributions from Tim Whiteaker, Dan Ames, Jon Goodall, and Jeff Horsburgh)

This page documents the current working design and design principles for the database schema that will support data caching using HIS Desktop. The purpose for the HIS Desktop database is to store and organize data obtained from web services and other local data sources.

Functional Requirements

The following functional requirements guide the development of the HIS Desktop database schema.

  1. The database will be a relational schema that is semantically similar to the Observations Data Model (ODM) and with similar naming conventions and data types.
  2. The schema will be simple enough for people to understand and program against.
    • Need specifics. 1st year grad student, recent Phd, Faculty, with no, small, significant, amount of programming, database experience.(dv)
    • is "We provide parameterized queries that users can use to get data" (dv)
  3. HIS desktop will support multiple "project" related databases. Each project will have a separate database. Projects are envisioned to be logical groupings of data. For example, the Little Bear River project, or the Great Salt Lake project, etc. HIS Desktop should be capable of easily moving data from one project database to another.
    • Multiple DB access is required to migrate data between projects.
  4. The schema will be the storage mechanism for the results of metadata queries against the HIS Central Metadata Web Services and the data that are retrieved.
    • This is a change in the some requirements/features, if I heard correctly. I thought that we were only storing information, if the data values were stored in the database (dv)
  5. The schema will be capable of being implemented in a number of different Relational Database Management Systems (RDBMSs) so that HIS Desktop can acheive platform independence. This means that we should avoid using features or datatypes that are not supported by more than one RDBMS.
    • suggestion use a database abstraction layer that allows the uses of different databases is utilized (dv)
  6. The schema will allow direct SQL access to all of its contents using SQL queries from different applications such as R, Matlab, Excel, or Visual Basic.
    • How do you get access to ESRI datasets, or datasets in NetCDF (dv)
  7. The schema will be capable of storing all information encoded in WaterML and returned by WaterOneFlow web services.
  8. The schema will store and manage the provenance of the data that it holds, comprised of the following information:
    • Where was the data obtained (e.g., which web service did it come from, or was the data loaded from a local file)?
    • How was the data loaded (e.g., did HIS Desktop get the data automatically from a web service, or did the user load it using other tools)?
    • What was the query that resulted in the data that was loaded?
    • What is the WaterML file or local file from which the data were loaded?
    • When was the data loaded into the database?
    • Has the user subscribed to the data for frequent updates?
    • When was the last time data were checked for updates?
    • When was the last time data were updated with new data?
    • What has been done to the data since it was added to the database (e.g., transformations like units conversion), what process was used to do the transformation, and when was it done?
      • forward provenance... interesting. We usually ask where did this dataset come from. We need to work at the timeseries/dataset level. (dv)
  9. The schema will be capable of storing data associated with geographic features consisting of points, lines, and polygons.
    • Indexing tables for geometries (will/will not) be included. Eg a single table that has the bounding box for all features.
    • Features that have no time series associated with them will/will not be in the database. (eg we will preload HUCs, streams etc)

Design Questions that Need to be Addressed

  • what is the structure needed to manage time series infomation in the database (requirements 4 & 8)?
  • What logical orgainzation of the data is best for the database (DataValues centric or Data Series centric) (reqs 1 & 2) ?
  • What other groupings of data (i.e., other than data series) does the database need to support? (reqs 1,2,3,4,& 8) See Data Collections in ODM for a discussion of other logical groupings that may be useful
  • What are the available physical implementations (i.e., MS SQL Server, MySQL, etc.) that could be used to acheive platform indepence for HIS Desktop? (req 5)
  • What is the form and functionality of the abstraction layer between the database and the HIS Desktop application if any?
    • None, Basic Programming practice, Data table, Object relational? (dv)
    • provides a data table model that can be used for progamming (eg ado.net; active record pattern)
    • provides an object model (subsonic, entity framework, nhibernate)
    • provides a query language
      • supports .net 3.5 framework LINQ
    • provides transactions
    • provides tools that create the DB schemas for different databases (nHibernate)
    • possible cross platform (nhibernate)
    • Is basic programming practice enough? (dv)
      • manage the database connection (was under database abstraction, but a a DB abstraction layer is more than a DB connection is the question, (dv))
      • if basic programming is chosen, it needs to be programmed to a standard to avoid SQL injection attacks. eg all queries need to be parameterized, or properly checked. (dv)
    • will the data layer fully manage the be interaction? (dv)
      • any translations for SQL queries (dv)
      • is an option for pass though queries (dv)
    • what data abstraction layer will be used for .Net programming?
  • What are the acceptable levels of performance for the database?
  • What is the design for storing provenance (other than keeping the WaterML file):
    • Do we create new data series when we do transformations or edits on existing data series?
    • Do we allow only transformations or edits at a data series level?
      • what functions should we support. (merge, replace values (have series with edits, and series to be edited), QA Filter, others (dv))
    • How do we store the definition of a transformation or edit?
    • What if we do a transformation and then want to update a data series with new data that are not transformed?
    • What, specifically, do we mean by "provenance?"
  • What information is needed to manage data "subscriptions?"
  • How do we manage sites?
    • Do we abstract sites to features that can be points, lines, or polygons (essentially get rid of the sites table in favor of a features table)?
    • Do we have points, lines, and polygons tables and store the geographic representation of the features in the database?
  • How do we enable the combination of data series that may belong together but that are separate by the ODM definition of a data series (e.g., Site, Variable, Method, Source, QualityControlLevel)
    • Consider where we have the same Site, Variable, and Method, but the source of the data changes because the source of funding for data collection changes. It is really one continuous time series.
    • Consider where we have the same Site, Variable, and Source, but the method changes because the instrument changes. It may be considered one continuous time series.
  • How do we manage the UnitConversions table, and how do we incorporate Speciation into unit conversions? Do we need to cross the Mole Bridge!?
  • Do we have a local copy of the ontology in the desktop database, and if so, how do we represent it and manage it?
    • This is a separate task... suggest only managing a concept mapping (dv)
  • How do we manage provenance of data loaded from a non-WaterOneFlow/WaterML source (e.g., local files).
  • How are downloaded data values that should be static, handled?
  • Should we enable the database to store multiple copies of the same data (e.g., multiple WaterML downloads of the same data on different dates). Even the USGS occassionally changes data values. If you do not maintain this information, the results of a model may change, and you may not have a record of why it happened.
  • How do we best include information from ODM that is not currently returned by WaterML (e.g., DerivedFrom, Data Grouping, etc.). I am putting this here because there are currently upgrades being made to the web services that we need to anticipate in the design of this database.
    • Think. If derived from is added... that means that you have to have previously have downloaded the pervious dataset for it to have meaning.


Alternatives

Two alternative designs are under consideration:
  1. Series centric
  2. Data values centric

Some pro's and con's for each are listed:

Series Centric Design Alternative

The following is the current working design of the Data Series Centric HIS Desktop database schema.

HIS Desktop Data Series Centric Database Schema Diagram

Series Centric Database Schema Visio 2007 Zip File

Pro's

  1. Direct links from catalog to data requiring simpler queries to access the data

Con's

  1. More difficult to manage provenance

Data Values Centric Design Alternative

The following is the current working design of the Data Values Centric HIS Desktop database schema.

HIS Desktop DataValues Centric Database Schema Diagram

Data Values Centric Database Schema Visio 2007 Zip File

Pro's

  1. Similar to ODM so code that works with ODM can be reused more easily, such as ODM tools and Time Series Analyst
  2. Management of provenance at the data values level is easier for cases where a single series comes from multiple GetValues calls
  3. May more easily support other queries besides series-centric queries, such as a snapshot in time across all space. (This could be tested, as Jeff was able to successfully construct various queries on a Series-Centric database for Little Bear River.)

Con's

  1. Difficulties associated with Catalog Regeneration
  2. Query performance

ScrewTurn Wiki version 3.0.1.400. Some of the icons created by FamFamFam.