CUAHIS-HIS
RSS

Navigation





Quick Search
»
Advanced Search »

PoweredBy

ODM - Best Practices - Loading Data into ODM

RSS
Modified on 2009/04/13 07:04 by twhit Categorized as ODM

Introduction

This document describes best practices for loading data into an ODM database. The recommendations reflect the experiences of both the CUAHSI team, and also the test beds making use of CUAHSI HIS.

The scope of the document is limited to general ways of organizing your data. For example, what do I use for SiteCodes, or should I be working with more than one observations network? As for the specific tool(s) you use for actually loading your data into an ODM database -- that is up to you. CUAHSI tools such as the ODDataLoader and Streaming Data Loader have documentation associated with them, and you are referred to those docs if you are ready to begin loading your data. However, if you're still confused about how your data best fits into the context of CUAHSI HIS, then this document provides some guidelines that may be of interest to you.

Resources

Observations Data Model (ODM) information and blank databases are at
http://water.usu.edu/cuahsi/odm/default.aspx

That site also has links to the ODM Tools, which are used for viewing and editing ODM data, and the Streaming Data Loader, which is designed to load data into ODM from files generated by sensors and data loggers.

The design specifications for ODM contains a some examples for how different types of data fit into the ODM. The specs are available at
http://water.usu.edu/cuahsi/odm/ODM_downloads.aspx

Another place to see examples is in the ODM Use Cases section.

ODM uses controlled vocabularies to help constrain the values that go in the database. For a list of the values, see
http://water.usu.edu/cuahsi/odm/cv.aspx

On that page, you can also propose changed or additions to the controlled vocabularies. These will be reviewed by the CUAHSI team, and the controlled vocabularies will be updated if deemed appropriate.

Another great resource is YOU. You can provide feedback, tips, and example files at the WATERS Testbed Data Managers Google group at
http://groups.google.com/group/waters-testbed-data-managers

Preparing Site Information

Assigning SiteCodes

SiteCode is a short, unique text identifier for a site, where a site is defined as a single record in the Sites table of the Observations Data Model. Both its uniqueness and its conciseness lend it towards performing queries on the data, as opposed to using the long name for the site.

Example:
I have a site named “USGS NWIS Stream Gage at the Colorado River at Austin”. This site has a SiteCode of “08158000”.

Make Them Unique within an ODM Database

Each site (e.g., a stream gage) is assigned a SiteCode that must be unique within your ODM database. SiteCode is used by various CUAHSI tools to locate a specific site within ODM, so that’s why it must be unique.

Example:
I operate a water quality observations network, that I’ll call UTdata. Within this network, I have two sensors on the Colorado River at different locations. I would NOT give both of these sites a SiteCode of “UTSite”. Instead, I might call one “UTSite1” and the other “UTSite2”.

Typically, each instance of an ODM database represents a single observation network. It’s ok for sites in different ODM databases to have the same SiteCode. These sites are differentiated by the observations network to which they belong.

Example:
I have a database of NWIS sites. The SiteCode for a stream gage in Austin is “08158000”. I also have my own water quality sensor that I have installed next to the USGS sensor, to which I had previously assigned a SiteCode of “UTSite1”. As a reminder that my sensor is at the same location as the USGS sensor, I give my sensor a SiteCode of “08158000”. This is ok, because the USGS site and my own site are in different observations networks, and in different ODM databases. CUAHSI tools which query an ODM database, such as DASH or the Excel HydroObjects sample spreadsheet that some of you may have seen, report the network along with the SiteCode, so it’s easy to keep track of which observations network you’re looking at.

Note that in this example, I could also just as easily have stuck with “UTSite1” for this site. The choice about which SiteCodes you assign is up to you, just so long as they are unique within your ODM database.

Make Them Concise

When we query for these data, we might be typing in these queries by hand. Typing in very long SiteCodes can be tedious, so I recommend keeping them brief.

Important: For your codes, try to avoid using special characters that might be used by other applications to parse text, or which might be altered or removed by web applications. Example of characters to avoid include commas, tildes, semicolons, plus signs, spaces, and colons.

Example:
“Colorado River, Texas: River Mile 41” is not a very good SiteCode because it is long and has some special characters in it. A better SiteCode is “CoRiver41”.

Preparing Variable Information

Assigning VariableCodes

VariableCode is a short, unique identifier for a variable, where a variable is defined as a single record within the Variables table of the Observations Data Model. Both its uniqueness and its conciseness lend it towards performing queries on the data, as opposed to using the long name for the variable.

Example:
I have a variable named “Carbon dioxide, transducer signal”. I have assigned it a VariableCode of “CO2sig”.

Make Them Unique within an ODM Database

Each variable is assigned a VariableCode that must be unique within your ODM database. VariableCode is used by various CUAHSI tools to locate a specific variable within ODM, so that’s why it must be unique.

Example:
My sensor reports minimum and maximum temperature for each reading. I have two variables named “Temperature” in my database, where one variable has a DataType of “Maximum” and the other has a DataType of “Minimum”. Because VariableCode is the field that is searched on by various CUAHSI tools, I must assign a code that also differentiates these two variables. I assign one variable a code of “TempMax”, and the other “TempMin”.

It’s ok for variables in different ODM databases to have the same VariableCode. This will not break any of the CUAHSI tools which operate on ODM databases. Just make sure that within a given ODM database, the VariableCode are unique for each record in the Variables table.

Make Them Concise

When we query for these data, we might be typing in these queries by hand. Typing in very long VariableCode can be tedious, so I recommend keeping them brief.

Important: For your codes, try to avoid using special characters that might be used by other applications to parse text, or which might be altered or removed by web applications. Example of characters to avoid include commas, tildes, semicolons, plus signs, spaces, and colons.

Assigning Variable Names, Units, and other Info

Before you load your data into an ODM database, have a look at the VariableNameCV table in the database. A set of suggested variable names has already been provided for you. You’ll see other suggested items in the other tables that end with “CV”, which stands for “controlled vocabulary”. The Units table also follows a controlled vocabulary.

It’s best to use the items from these controlled vocabularies to describe your data. This makes it much, much easier to relate your data to other datasets that are out there. That way, if someone is looking at a Temperature variable in your database and a Temperature variable in other ODM database from another university, they can have confidence that they are looking at the same thing.

The CUAHSI controlled vocabularies are documented at
http://water.usu.edu/cuahsi/odm/cv.aspx

You can click on a vocabulary to view it, and even propose changes and additions to the vocabularies.

Also, note that if you use the ODDataLoader to load your data, it tries to match the units from your input file, to units in the target ODM database. If your input file has units that aren’t in the database, then the ODDataLoader may assign the incorrect units to your data. Thus, it’s best to use CUAHSI controlled vocabulary units for your data.

The ODM Tools includes a function that allows you to reconcile your ODM CV's with the master set, which may change as people propose their own additions to the vocabularies. This helps you update your ODM with new CV items that have been added to the master set, and also identifies items that you have added to your own database which are not in the master set.

Publishing Multiple Observations Networks

Suppose you have data from several different observation networks (perhaps three different research projects at your university), and you want to publish all the data on your HIS Server. Do you load all the data into one ODM database? Do you create one web service to run on top of the data? Or do you split things up into separate networks.

The answer depends on your data and your own preferences. So I will just provide some things to consider here.

Modularity vs. Client Convenience

The advantage of loading every network into its own database, is that you keep each database free from problems that may arise due to other databases in the system. This is like a modular approach. For example, if I realize that I loaded all of my groundwater data incorrectly, I can just wipe out that entire database and reload it. However, if the groundwater data were sitting in a database with data from other networks, then I wouldn’t want to wipe out that database, because I would have to reload all of the data from the other network which may not have anything wrong with it. I’d have to selectively remove the groundwater records and reload them, while being careful not to mess up anything related to my other networks in the database.

The disadvantage of this approach, is that if you just want to “get at the data hosted by my university”, and you have seven different observations networks, you’ll have to read seven different ODM databases, or access seven different web services, or query seven different layers in DASH.

How many sites do you have?

If you just have 10 or 20 sites in each network, then it probably wouldn’t hurt to load all of the data into a single ODM database.

Is the character of the data similar?

If both of your observations networks are measuring the same variables, or have other such commonalities, then it probably makes sense to put them in the same database.

Updating the ODM Database

Updating Controlled Vocabularies

The ODM Tools provide functionality for updating controlled vocabularies. The tools and documentation are available at
http://water.usu.edu/cuahsi/odm/ODM_downloads.aspx

If you want to update your databases yourself, you can get the latest controlled vocabularies from the web service at
http://water.usu.edu/odmcv/odmcv.asmx

If you need to have an item, which is not part of the controlled vocabularies, then you should view, edit, or create a new controlled vocabulary item at
http://water.usu.edu/cuahsi/odm/cv.aspx

Once a controlled vocabulary item has been submitted, the HIS team will review it, and add it if they approve. Then, you can incorporate the new item into your ODM databases by using the ODM Tools.

Updating Values

For data coming in from sensors, the Streaming Data Loader is a great tool that can be run periodically to update your database. Download at
http://water.usu.edu/cuahsi/odm/ODM_downloads.aspx

If you need another tool, you could try the ODDataLoader, available at
http://water.sdsc.edu/ODDataloader/

The ODDataLoader reads data from flat files (text files or Excel worksheets) and loads the data into an ODM database.

The third option is SQL Server Integration Services (SSIS). CUAHSI does not provide formal SSIS tutorials to assist with data loading. However, the team at Utah State University did experiment with loading data from the Sevilleta Global Change Experiment into ODM with SSIS. Sample scripts and a presentation can be found at
http://his02.usu.edu/sevilleta/BasicData.htm

Dates and Times

UTC and Local Times

Unambiguous interpretation of date and time information requires specification of the time zone or offset from universal time (UTC). A UTCOffset field is included in the DataValues table to ensure that local times recorded in the database can be referenced to standard time and to enable comparison of results across databases that may store data values collected in different time zones (e.g. compare data values from one hydrologic observatory to those collected at another hydrologic observatory located across the country). UTCOffset is a record level qualifier because even though the time zone, and hence offset, is likely the same for all measurements at a site, the offset may change due to daylight savings. Some investigators may run data loggers on UTC time, while others may use local time adjusting for daylight saving time. To avoid the necessity to keep track of the system used, or impose a system that might be cumbersome and lead to errors, we decided that if the offset was always recorded, the precise time would be unambiguous and would reduce the chance for interpretation errors. A field DateTimeUTC is also included as a record level attribute associated with each data value. This provides a consistent time for querying and sorting data values. There is a level of redundancy between LocalDateTime, UTCOffset and DateTimeUTC. Only two are required to calculate the third. For simplicity and clarity we retain all three. ODM data loaders should only require two of the quantities to be input and should then calculate the third.

Best Practice for Recording Times At the Sensor

Even though it may be known that a particular sensor network switches from daylight savings time to standard time, it's possible that the operator may not have switched the data logger to standard time at exactly the right time of day, leading to errors if a blanket calculation of UTCOffset is applied using "official" dates and times for the switch to standard time.

To avoid these problems, we recommend that all data loggers be set to local standard time and record in that time even during periods of daylight savings time. It's much easier to adopt this convention at the source, rather than try to compute the correct UTCOffset afterward.

Beginning of Interval Reporting Time for Interval Data Values

Some data types used in ODM apply to data values that occur over an interval of time. The date and time reported and entered in to the ODM database associated with each interval data value should be the beginning time of the observation interval. This convention was adopted to be consistent with the way dates and times are represented in most common database management systems. It should be noted that using the beginning of the interval is not consistent with the time a data logger would log an observation value. Care should be exercised in adding data to the ODM to ensure that the beginning of interval convention is followed.

This beginning of interval reporting convention applies regardless of the actual dates or time zone. For example given daily values recorded in a time zone with UTCOffset of -6 a consistent set of time specifiers is:
LocalDateTime: 1987-03-12T00:00:00
UTCOffset: -6
UTCDateTime: 1987-03-11T18:00:00
Or for a more unusual case
LocalDateTime: 1987-01-01T00:00:00
UTCOffset: -6
UTCDateTime: 1986-12-31T18:00:00

Frequent Asked Questions (FAQ)

What units do I use for pH

You might be tempted to use the controlled vocabulary item called "dimensionless", but since pH is a special case, a dimensionless controlled vocabulary item called "pH Unit" has been created for that purpose.

ScrewTurn Wiki version 3.0.1.400. Some of the icons created by FamFamFam.