HIS Desktop Database
(With contributions from Tim Whiteaker, Dan Ames, Jon Goodall, and Jeff Horsburgh)
This page documents the current working design and design principles for the database schema that will support data caching using HIS Desktop. The purpose for the HIS Desktop database is to store and organize data obtained from web services and other local data sources.
Functional Requirements
The following functional requirements guide the development of the HIS Desktop database schema.
- The database will be a relational schema that is semantically similar to the Observations Data Model (ODM) and with similar naming conventions and data types.
- The schema will be simple enough for people to understand and program against.
Need specifics. 1st year grad student, recent Phd, Faculty, with no, small, significant, amount of programming, database experience.(dv)
- is "We provide parameterized queries that users can use to get data" (dv)
- HIS desktop will support multiple "project" related databases. Each project will have a separate database. Projects are envisioned to be logical groupings of data. For example, the Little Bear River project, or the Great Salt Lake project, etc. HIS Desktop should be capable of easily moving data from one project database to another.
- Multiple DB access is required to migrate data between projects.
- The schema will be the storage mechanism for the results of metadata queries against the HIS Central Metadata Web Services and the data that are retrieved.
- The schema will be capable of being implemented in a number of different Relational Database Management Systems (RDBMSs) so that HIS Desktop can acheive platform independence. This means that we should avoid using features or datatypes that are not supported by more than one RDBMS.
- The schema will allow direct SQL access to all of its contents using SQL queries from different applications such as R, Matlab, Excel, or Visual Basic.
- The schema will be capable of storing all information encoded in WaterML and returned by WaterOneFlow web services.
- The schema will store and manage the provenance of the data that it holds, comprised of the following information:
- Where was the data obtained (e.g., which web service did it come from, or was the data loaded from a local file)?
- How was the data loaded (e.g., did HIS Desktop get the data automatically from a web service, or did the user load it using other tools)?
- What was the query that resulted in the data that was loaded?
- What is the WaterML file or local file from which the data were loaded?
- When was the data loaded into the database?
- Has the user subscribed to the data for frequent updates?
- When was the last time data were checked for updates?
- When was the last time data were updated with new data?
- What has been done to the data since it was added to the database (e.g., transformations like units conversion), what process was used to do the transformation, and when was it done?
- The schema will be capable of storing data associated with geographic features consisting of points, lines, and polygons.
- Indexing tables for geometries (will/will not) be included. Eg a single table that has the bounding box for all features.
- Features that have no time series associated with them will/will not be in the database. (eg we will preload HUCs, streams etc)
Design Questions that Need to be Addressed
- what is the structure needed to manage time series infomation in the database (requirements 4 & 8)?
- What logical orgainzation of the data is best for the database (DataValues centric or Data Series centric) (reqs 1 & 2) ?
- What other groupings of data (i.e., other than data series) does the database need to support? (reqs 1,2,3,4,& 8) See Data Collections in ODM for a discussion of other logical groupings that may be useful
- What are the available physical implementations (i.e., MS SQL Server, MySQL, etc.) that could be used to acheive platform indepence for HIS Desktop? (req 5)
- What is the form and functionality of the abstraction layer between the database and the HIS Desktop application if any?
- None, Basic Programming practice, Data table, Object relational? (dv)
- provides a data table model that can be used for progamming (eg ado.net; active record pattern)
- provides an object model (subsonic, entity framework, nhibernate)
- provides a query language
- supports .net 3.5 framework LINQ
- provides transactions
- provides tools that create the DB schemas for different databases (nHibernate)
- possible cross platform (nhibernate)
- Is basic programming practice enough? (dv)
- manage the database connection (was under database abstraction, but a a DB abstraction layer is more than a DB connection is the question, (dv))
- if basic programming is chosen, it needs to be programmed to a standard to avoid SQL injection attacks. eg all queries need to be parameterized, or properly checked. (dv)
- will the data layer fully manage the be interaction? (dv)
- any translations for SQL queries (dv)
- is an option for pass though queries (dv)
- what data abstraction layer will be used for .Net programming?
- What are the acceptable levels of performance for the database?
- What is the design for storing provenance (other than keeping the WaterML file):
- Do we create new data series when we do transformations or edits on existing data series?
- Do we allow only transformations or edits at a data series level?
- what functions should we support. (merge, replace values (have series with edits, and series to be edited), QA Filter, others (dv))
- How do we store the definition of a transformation or edit?
- What if we do a transformation and then want to update a data series with new data that are not transformed?
- What, specifically, do we mean by "provenance?"
- What information is needed to manage data "subscriptions?"
- How do we manage sites?
- Do we abstract sites to features that can be points, lines, or polygons (essentially get rid of the sites table in favor of a features table)?
- Do we have points, lines, and polygons tables and store the geographic representation of the features in the database?
- How do we enable the combination of data series that may belong together but that are separate by the ODM definition of a data series (e.g., Site, Variable, Method, Source, QualityControlLevel)
- Consider where we have the same Site, Variable, and Method, but the source of the data changes because the source of funding for data collection changes. It is really one continuous time series.
- Consider where we have the same Site, Variable, and Source, but the method changes because the instrument changes. It may be considered one continuous time series.
- How do we manage the UnitConversions table, and how do we incorporate Speciation into unit conversions? Do we need to cross the Mole Bridge!?
- Do we have a local copy of the ontology in the desktop database, and if so, how do we represent it and manage it?
- This is a separate task... suggest only managing a concept mapping (dv)
- How do we manage provenance of data loaded from a non-WaterOneFlow/WaterML source (e.g., local files).
- How are downloaded data values that should be static, handled?
- Should we enable the database to store multiple copies of the same data (e.g., multiple WaterML downloads of the same data on different dates). Even the USGS occassionally changes data values. If you do not maintain this information, the results of a model may change, and you may not have a record of why it happened.
- How do we best include information from ODM that is not currently returned by WaterML (e.g., DerivedFrom, Data Grouping, etc.). I am putting this here because there are currently upgrades being made to the web services that we need to anticipate in the design of this database.
Alternatives
Two alternative designs are under consideration:
- Series centric
- Data values centric
Some pro's and con's for each are listed:
Series Centric Design Alternative
The following is the current working design of the Data Series Centric HIS Desktop database schema.
Series Centric Database Schema Visio 2007 Zip FilePro's
- Direct links from catalog to data requiring simpler queries to access the data
Con's
- More difficult to manage provenance
Data Values Centric Design Alternative
The following is the current working design of the Data Values Centric HIS Desktop database schema.
Data Values Centric Database Schema Visio 2007 Zip File
Pro's
- Similar to ODM so code that works with ODM can be reused more easily, such as ODM tools and Time Series Analyst
- Management of provenance at the data values level is easier for cases where a single series comes from multiple GetValues calls
- May more easily support other queries besides series-centric queries, such as a snapshot in time across all space. (This could be tested, as Jeff was able to successfully construct various queries on a Series-Centric database for Little Bear River.)
Con's
- Difficulties associated with Catalog Regeneration
- Query performance