This has two parts, the first a discussion from an evaluation of ODM from the Austrialan Bureau of Meteorology.
The second is suggestions on how to do a performance evaluation on ODM.
AU BOM
The
Cuashi HIS Data Loading document discusses loading 167 million records in ODM. Discusses performance issues, bottle necks, possible solutions to the bottlenecks, and implications to the solutions.
They proposed the solution
Data Loading Proposal Highlights
Slow Data Loading on large DataValues table
that there was a progressive slow-down in processing speed, from 5,700 records / second in a small trial, slowing to 2,700 records / second after 33 million inserts, then on to a paltry 900 records / second over the last 50 million records to make up the ~83 million of post-2000 data values.
Long Custered index regeneration
Clustered indexes are used on the data values table.
| Re-build clustered index 3: | 20 minutes | 6 million | 5,000 |
| measured <10 hours (estimate 9.5) | | 167 million | 4,700 |
Suggested Methods to Test ODM
NOTES on ODM Schema samples and Performance Testing (by DaveV):
ODM schema samples and validation cases¶
Small example sets with predicable results
Number of sites:
Number of variables:
Number of series per site, or per siteCode counts
Number of data values per series, or per series counts
- Examples E.1, E.2, E.3, E.4, E.5
- Small example sets, two to 4 stations, one or two variables, <1000 values with predicable results
- Site with no series or data values
- What behavior do we expect to see
- Variable differentiation
- We need to document which variables are similar so that users can identify similar variables across data sources, even though the variableCodes are different. We also have examples where VariableCode is often the same across a set of data (eg USGS), and other columns need to be used to differentiation them (the variable vocabulary the may be different).
- Provide at least 4 example sets of similar variables, where a set is a combinations of units, datatype, valuetype, time support, nodatavalue, sample medium, speciation, quality control level and other characteristics that represent examples where variable retrieval needs to be differentiated.
- Sets need to help document What is the expected matching behavior for ‘unknown’
- Within a set, provide instructions on which variables rows are the same, and which are different.
- For the future services design specification and metadata catalog, which ones should be retrieved as the equivalent.
- Time Support and Data Type
- Additional examples for combinations discussed in ODM design specification
- Examples that utilize the no data value
- Example that demonstrated, sample identifier, sample
- One site, one sample, one data value
- One site, one sample, multiple data values derived from sample, and associated series records
- Derived data sample
- One site, 1 variable, 1 derived variable
- Three sites, n <3 variables, derived variables
- Multiple instruments measuring same at same site
- Other examples as determined by use cases
- Bad Data (make sure these are not possible)
- Data values with no site and/or variable
- Data values associated with multiple series
- Dynamic test cases.
- Start with existing DB, load additional information, check that information is in DB
Client output
- Examples of properly using Time Support to output precision of information
- How to handle min max daily statistical observations where the time is important
- Excel output
- If time support is day, output no time
- If time support is hour, , output precision to hour
- If time support is month, output to precision to min
- Example on how to properly handle no data value
ODM performance testing
What is the metric? Database size, data load performance, data retrieval performance?
Suggested scenarios
- Y Sites, N Series, O data values
- 1 site, 1 series, YxNxO data values
- Third and foruth intermediate scenarios
Single Stream Performance:
- Data load performance for various data spacing.
- load data, for above scenarios
- 30 seconds
- 10 seconds
- 5 seconds
- 2 seconds
Performance for loading multiple streams
- Data load performance for various data spacing.
- load data, for above scenarios
- 30 seconds
- 10 seconds
- 5 seconds
- 2 seconds
Data retrieval performance
- Retrieve data from above scenarios
- retrieve data while loading data
ODM Performance Testing Observations over EPA Catalog
The EPA Catalog data was used to generate a series catalog from the data.
ODM SeriesCatalog Query
SELECT SiteID, VariableID, MethodID, QualityControlLevelID, SourceID, MIN(LocalDateTime) AS BeginDateTime,
MAX(LocalDateTime) AS EndDateTime, MIN(DateTimeUTC) AS BeginDateTimeUTC, MAX(DateTimeUTC) AS EndDateTimeUTC,
COUNT(DataValue) AS ValueCount
FROM DataValues
GROUP BY SiteID, VariableID, MethodID, QualityControlLevelID, SourceID) dv
INNER JOIN dbo.Sites s ON dv.SiteID = s.SiteID
INNER JOIN dbo.Variables v ON dv.VariableID = v.VariableID
INNER JOIN dbo.Units u ON v.VariableUnitsID = u.UnitsID
INNER JOIN dbo.Methods m ON dv.MethodID = m.MethodID
INNER JOIN dbo.Units u1 ON v.TimeUnitsID = u1.UnitsID
INNER JOIN dbo.Sources so ON dv.SourceID = so.SourceID
INNER JOIN dbo.QualityControlLevels qc ON dv.QualityControlLevelID = qc.QualityControlLevelID
GROUP BY dv.SiteID, s.SiteCode, s.SiteName, dv.VariableID, v.VariableCode, v.VariableName, v.Speciation,
v.VariableUnitsID, u.UnitsName, v.SampleMedium, v.ValueType, v.TimeSupport, v.TimeUnitsID, u1.UnitsName,
v.DataType, v.GeneralCategory, dv.MethodID, m.MethodDescription, dv.SourceID, so.Organization,
so.SourceDescription, so.Citation, dv.QualityControlLevelID, qc.QualityControlLevelCode, dv.BeginDateTime,
dv.EndDateTime, dv.BeginDateTimeUTC, dv.EndDateTimeUTC, dv.ValueCount
ORDER BY dv.SiteID, dv.VariableID, v.VariableUnitsID
ODM SeriesCatalog generation on EPA Storet
EPA Storet Regular Results Table
- 57 million rows
- 3.78 million "series"
- Query similar to the standard ODM series catalog took 15 hours to generate 105,364 rows for a series
- need to table scan, since the min/max functions are in there
Original query
SELECT TOP (100) PERCENT FK_STATION, FK_CHAR, FK_ACT_MEDIUM, COUNT_BIG(PK_ISN) AS ValueCount, MIN(ACTIVITY_START_DATE_TIME)
AS BeginDateTime, MAX(ACTIVITY_START_DATE_TIME) AS EndDateTime
FROM dbo.FA_REGULAR_RESULT
GROUP BY FK_CHAR, FK_ACT_MEDIUM, FK_STATION
HAVING (FK_CHAR IS NOT NULL) AND (FK_STATION IS NOT NULL)
New query First pass:
SELECT TOP (100) PERCENT FK_STATION, FK_CHAR, FK_ACT_MEDIUM, COUNT_BIG(PK_ISN) AS ValueCount
FROM dbo.FA_REGULAR_RESULT
GROUP BY FK_CHAR, FK_ACT_MEDIUM, FK_STATION
HAVING (FK_CHAR IS NOT NULL) AND (FK_STATION IS NOT NULL)
Then have a view that will
SELECT epa_distinctView_no_dates.FK_STATION, epa_distinctView_no_dates.FK_CHAR, epa_distinctView_no_dates.FK_ACT_MEDIUM,
epa_distinctView_no_dates.ValueCount, MIN(FA_REGULAR_RESULT.ACTIVITY_START_DATE_TIME) AS Expr1
FROM epa_distinctView_no_dates LEFT OUTER JOIN
FA_REGULAR_RESULT ON epa_distinctView_no_dates.FK_STATION = FA_REGULAR_RESULT.FK_STATION AND
epa_distinctView_no_dates.FK_CHAR = FA_REGULAR_RESULT.FK_CHAR AND
epa_distinctView_no_dates.FK_ACT_MEDIUM = FA_REGULAR_RESULT.FK_ACT_MEDIUM
where epa_distinctView_no_dates.FK_STATION = 108167
and epa_distinctView_no_dates.FK_CHAR= 1
and epa_distinctView_no_dates.FK_ACT_MEDIUM = 1
GROUP BY epa_distinctView_no_dates.FK_STATION, epa_distinctView_no_dates.FK_CHAR, epa_distinctView_no_dates.FK_ACT_MEDIUM,
epa_distinctView_no_dates.ValueCount
Adding min/max just kill the query speed
SELECT UniqueSeries.FK_STATION, UniqueSeries.FK_CHAR, UniqueSeries.FK_ACT_MEDIUM, dbo.epa_odm_variables.VariableName,
dbo.epa_odm_variables.SampleMedium, dbo.epa_odm_variables.VariableCode, dbo.epa_odm_variables.variableType,
dbo.epa_odm_variables.DataType, dbo.ODM_SITES.SiteID, dbo.ODM_SITES.SiteName, dbo.ODM_SITES.LONGITUDE, dbo.ODM_SITES.LATITUDE,
dbo.ODM_SITES.County, dbo.ODM_SITES.Elevation_M, dbo.ODM_SITES.State, dbo.ODM_SITES.DESCRIPTION_TEXT,
dbo.ODM_SITES.FIPS_STATE_CODE, dbo.ODM_SITES.FIPS_COUNTY_CODE, dbo.ODM_SITES.EPA_StationID, dbo.ODM_SITES.StationType,
dbo.ODM_SITES.EPA_HUC, dbo.ODM_SITES.EPA_OrgID, dbo.ODM_SITES.EPA_OrgName, dbo.ODM_SITES.SiteCode, dbo.ODM_SITES.vertical_datum,
UniqueSeries.ValueCount, MIN(dbo.FA_REGULAR_RESULT.ACTIVITY_START_DATE_TIME) AS BeginDate
FROM dbo.epa_distinctSeries_no_dates AS UniqueSeries LEFT OUTER JOIN
dbo.FA_REGULAR_RESULT ON UniqueSeries.FK_STATION = dbo.FA_REGULAR_RESULT.FK_STATION AND
UniqueSeries.FK_CHAR = dbo.FA_REGULAR_RESULT.FK_CHAR AND
UniqueSeries.FK_ACT_MEDIUM = dbo.FA_REGULAR_RESULT.FK_ACT_MEDIUM LEFT OUTER JOIN
dbo.epa_odm_variables ON UniqueSeries.FK_CHAR = dbo.epa_odm_variables.FK_CHAR AND
UniqueSeries.FK_ACT_MEDIUM = dbo.epa_odm_variables.FK_ACT_MEDIUM LEFT OUTER JOIN
dbo.ODM_SITES ON UniqueSeries.FK_STATION = dbo.ODM_SITES.SiteID
GROUP BY UniqueSeries.FK_STATION, UniqueSeries.FK_CHAR, UniqueSeries.FK_ACT_MEDIUM, dbo.epa_odm_variables.VariableName,
dbo.epa_odm_variables.SampleMedium, dbo.epa_odm_variables.VariableCode, dbo.epa_odm_variables.variableType,
dbo.epa_odm_variables.DataType, dbo.ODM_SITES.SiteID, dbo.ODM_SITES.SiteName, dbo.ODM_SITES.LONGITUDE, dbo.ODM_SITES.LATITUDE,
dbo.ODM_SITES.County, dbo.ODM_SITES.Elevation_M, dbo.ODM_SITES.State, dbo.ODM_SITES.DESCRIPTION_TEXT,
dbo.ODM_SITES.FIPS_STATE_CODE, dbo.ODM_SITES.FIPS_COUNTY_CODE, dbo.ODM_SITES.EPA_StationID, dbo.ODM_SITES.StationType,
dbo.ODM_SITES.EPA_HUC, dbo.ODM_SITES.EPA_OrgID, dbo.ODM_SITES.EPA_OrgName, dbo.ODM_SITES.SiteCode, dbo.ODM_SITES.vertical_datum,
UniqueSeries.ValueCount
HAVING (dbo.ODM_SITES.SiteCode = @sitecode) AND (dbo.epa_odm_variables.VariableCode = @variableCode)