CUAHIS-HIS
RSS

Navigation





Quick Search
»
Advanced Search »

PoweredBy

ODM Performance Testing

RSS
Modified on 2009/03/25 10:09 by valentinedwv Categorized as Uncategorized
This has two parts, the first a discussion from an evaluation of ODM from the Austrialan Bureau of Meteorology. The second is suggestions on how to do a performance evaluation on ODM.

AU BOM

The Cuashi HIS Data Loading document discusses loading 167 million records in ODM. Discusses performance issues, bottle necks, possible solutions to the bottlenecks, and implications to the solutions. They proposed the solution Data Loading Proposal

Highlights

Slow Data Loading on large DataValues table

that there was a progressive slow-down in processing speed, from 5,700 records / second in a small trial, slowing to 2,700 records / second after 33 million inserts, then on to a paltry 900 records / second over the last 50 million records to make up the ~83 million of post-2000 data values.

Long Custered index regeneration

Clustered indexes are used on the data values table.
Re-build clustered index 3:20 minutes6 million5,000
measured <10 hours (estimate 9.5)167 million 4,700

Suggested Methods to Test ODM

NOTES on ODM Schema samples and Performance Testing (by DaveV):

ODM schema samples and validation cases

Small example sets with predicable results Number of sites: Number of variables: Number of series per site, or per siteCode counts Number of data values per series, or per series counts

  • Examples E.1, E.2, E.3, E.4, E.5
    • Small example sets, two to 4 stations, one or two variables, <1000 values with predicable results
  • Site with no series or data values
    • What behavior do we expect to see
  • Variable differentiation
    • We need to document which variables are similar so that users can identify similar variables across data sources, even though the variableCodes are different. We also have examples where VariableCode is often the same across a set of data (eg USGS), and other columns need to be used to differentiation them (the variable vocabulary the may be different).
    • Provide at least 4 example sets of similar variables, where a set is a combinations of units, datatype, valuetype, time support, nodatavalue, sample medium, speciation, quality control level and other characteristics that represent examples where variable retrieval needs to be differentiated.
    • Sets need to help document What is the expected matching behavior for ‘unknown’
    • Within a set, provide instructions on which variables rows are the same, and which are different.
    • For the future services design specification and metadata catalog, which ones should be retrieved as the equivalent.
  • Time Support and Data Type
    • Additional examples for combinations discussed in ODM design specification
  • Examples that utilize the no data value
  • Example that demonstrated, sample identifier, sample
    • One site, one sample, one data value
    • One site, one sample, multiple data values derived from sample, and associated series records
  • Derived data sample
    • One site, 1 variable, 1 derived variable
    • Three sites, n <3 variables, derived variables
  • Multiple instruments measuring same at same site
  • Other examples as determined by use cases
  • Bad Data (make sure these are not possible)
    • Data values with no site and/or variable
    • Data values associated with multiple series
  • Dynamic test cases.
    • Start with existing DB, load additional information, check that information is in DB

Client output

  • Examples of properly using Time Support to output precision of information
    • How to handle min max daily statistical observations where the time is important
    • Excel output
      • If time support is day, output no time
      • If time support is hour, , output precision to hour
      • If time support is month, output to precision to min
  • Example on how to properly handle no data value

ODM performance testing

What is the metric? Database size, data load performance, data retrieval performance?

Suggested scenarios

  • Y Sites, N Series, O data values
  • 1 site, 1 series, YxNxO data values
  • Third and foruth intermediate scenarios

Single Stream Performance:
  • Data load performance for various data spacing.
    • load data, for above scenarios
      • 30 seconds
      • 10 seconds
      • 5 seconds
      • 2 seconds

Performance for loading multiple streams
  • Data load performance for various data spacing.
    • load data, for above scenarios
      • 30 seconds
      • 10 seconds
      • 5 seconds
      • 2 seconds

Data retrieval performance
  • Retrieve data from above scenarios
  • retrieve data while loading data

ODM Performance Testing Observations over EPA Catalog

The EPA Catalog data was used to generate a series catalog from the data.

ODM SeriesCatalog Query

SELECT SiteID, VariableID, MethodID, QualityControlLevelID, SourceID, MIN(LocalDateTime) AS BeginDateTime, MAX(LocalDateTime) AS EndDateTime, MIN(DateTimeUTC) AS BeginDateTimeUTC, MAX(DateTimeUTC) AS EndDateTimeUTC, COUNT(DataValue) AS ValueCount FROM DataValues GROUP BY SiteID, VariableID, MethodID, QualityControlLevelID, SourceID) dv INNER JOIN dbo.Sites s ON dv.SiteID = s.SiteID INNER JOIN dbo.Variables v ON dv.VariableID = v.VariableID INNER JOIN dbo.Units u ON v.VariableUnitsID = u.UnitsID INNER JOIN dbo.Methods m ON dv.MethodID = m.MethodID INNER JOIN dbo.Units u1 ON v.TimeUnitsID = u1.UnitsID INNER JOIN dbo.Sources so ON dv.SourceID = so.SourceID INNER JOIN dbo.QualityControlLevels qc ON dv.QualityControlLevelID = qc.QualityControlLevelID GROUP BY dv.SiteID, s.SiteCode, s.SiteName, dv.VariableID, v.VariableCode, v.VariableName, v.Speciation, v.VariableUnitsID, u.UnitsName, v.SampleMedium, v.ValueType, v.TimeSupport, v.TimeUnitsID, u1.UnitsName, v.DataType, v.GeneralCategory, dv.MethodID, m.MethodDescription, dv.SourceID, so.Organization, so.SourceDescription, so.Citation, dv.QualityControlLevelID, qc.QualityControlLevelCode, dv.BeginDateTime, dv.EndDateTime, dv.BeginDateTimeUTC, dv.EndDateTimeUTC, dv.ValueCount ORDER BY dv.SiteID, dv.VariableID, v.VariableUnitsID



ODM SeriesCatalog generation on EPA Storet

EPA Storet Regular Results Table
  • 57 million rows
  • 3.78 million "series"
  • Query similar to the standard ODM series catalog took 15 hours to generate 105,364 rows for a series
    • need to table scan, since the min/max functions are in there

Original query SELECT TOP (100) PERCENT FK_STATION, FK_CHAR, FK_ACT_MEDIUM, COUNT_BIG(PK_ISN) AS ValueCount, MIN(ACTIVITY_START_DATE_TIME) AS BeginDateTime, MAX(ACTIVITY_START_DATE_TIME) AS EndDateTime FROM dbo.FA_REGULAR_RESULT GROUP BY FK_CHAR, FK_ACT_MEDIUM, FK_STATION HAVING (FK_CHAR IS NOT NULL) AND (FK_STATION IS NOT NULL)



New query First pass: SELECT TOP (100) PERCENT FK_STATION, FK_CHAR, FK_ACT_MEDIUM, COUNT_BIG(PK_ISN) AS ValueCount FROM dbo.FA_REGULAR_RESULT GROUP BY FK_CHAR, FK_ACT_MEDIUM, FK_STATION HAVING (FK_CHAR IS NOT NULL) AND (FK_STATION IS NOT NULL)

Then have a view that will SELECT epa_distinctView_no_dates.FK_STATION, epa_distinctView_no_dates.FK_CHAR, epa_distinctView_no_dates.FK_ACT_MEDIUM, epa_distinctView_no_dates.ValueCount, MIN(FA_REGULAR_RESULT.ACTIVITY_START_DATE_TIME) AS Expr1 FROM epa_distinctView_no_dates LEFT OUTER JOIN FA_REGULAR_RESULT ON epa_distinctView_no_dates.FK_STATION = FA_REGULAR_RESULT.FK_STATION AND epa_distinctView_no_dates.FK_CHAR = FA_REGULAR_RESULT.FK_CHAR AND epa_distinctView_no_dates.FK_ACT_MEDIUM = FA_REGULAR_RESULT.FK_ACT_MEDIUM

where epa_distinctView_no_dates.FK_STATION = 108167 and epa_distinctView_no_dates.FK_CHAR= 1 and epa_distinctView_no_dates.FK_ACT_MEDIUM = 1

GROUP BY epa_distinctView_no_dates.FK_STATION, epa_distinctView_no_dates.FK_CHAR, epa_distinctView_no_dates.FK_ACT_MEDIUM, epa_distinctView_no_dates.ValueCount


Adding min/max just kill the query speed

SELECT UniqueSeries.FK_STATION, UniqueSeries.FK_CHAR, UniqueSeries.FK_ACT_MEDIUM, dbo.epa_odm_variables.VariableName, dbo.epa_odm_variables.SampleMedium, dbo.epa_odm_variables.VariableCode, dbo.epa_odm_variables.variableType, dbo.epa_odm_variables.DataType, dbo.ODM_SITES.SiteID, dbo.ODM_SITES.SiteName, dbo.ODM_SITES.LONGITUDE, dbo.ODM_SITES.LATITUDE, dbo.ODM_SITES.County, dbo.ODM_SITES.Elevation_M, dbo.ODM_SITES.State, dbo.ODM_SITES.DESCRIPTION_TEXT, dbo.ODM_SITES.FIPS_STATE_CODE, dbo.ODM_SITES.FIPS_COUNTY_CODE, dbo.ODM_SITES.EPA_StationID, dbo.ODM_SITES.StationType, dbo.ODM_SITES.EPA_HUC, dbo.ODM_SITES.EPA_OrgID, dbo.ODM_SITES.EPA_OrgName, dbo.ODM_SITES.SiteCode, dbo.ODM_SITES.vertical_datum, UniqueSeries.ValueCount, MIN(dbo.FA_REGULAR_RESULT.ACTIVITY_START_DATE_TIME) AS BeginDate FROM dbo.epa_distinctSeries_no_dates AS UniqueSeries LEFT OUTER JOIN dbo.FA_REGULAR_RESULT ON UniqueSeries.FK_STATION = dbo.FA_REGULAR_RESULT.FK_STATION AND UniqueSeries.FK_CHAR = dbo.FA_REGULAR_RESULT.FK_CHAR AND UniqueSeries.FK_ACT_MEDIUM = dbo.FA_REGULAR_RESULT.FK_ACT_MEDIUM LEFT OUTER JOIN dbo.epa_odm_variables ON UniqueSeries.FK_CHAR = dbo.epa_odm_variables.FK_CHAR AND UniqueSeries.FK_ACT_MEDIUM = dbo.epa_odm_variables.FK_ACT_MEDIUM LEFT OUTER JOIN dbo.ODM_SITES ON UniqueSeries.FK_STATION = dbo.ODM_SITES.SiteID GROUP BY UniqueSeries.FK_STATION, UniqueSeries.FK_CHAR, UniqueSeries.FK_ACT_MEDIUM, dbo.epa_odm_variables.VariableName, dbo.epa_odm_variables.SampleMedium, dbo.epa_odm_variables.VariableCode, dbo.epa_odm_variables.variableType, dbo.epa_odm_variables.DataType, dbo.ODM_SITES.SiteID, dbo.ODM_SITES.SiteName, dbo.ODM_SITES.LONGITUDE, dbo.ODM_SITES.LATITUDE, dbo.ODM_SITES.County, dbo.ODM_SITES.Elevation_M, dbo.ODM_SITES.State, dbo.ODM_SITES.DESCRIPTION_TEXT, dbo.ODM_SITES.FIPS_STATE_CODE, dbo.ODM_SITES.FIPS_COUNTY_CODE, dbo.ODM_SITES.EPA_StationID, dbo.ODM_SITES.StationType, dbo.ODM_SITES.EPA_HUC, dbo.ODM_SITES.EPA_OrgID, dbo.ODM_SITES.EPA_OrgName, dbo.ODM_SITES.SiteCode, dbo.ODM_SITES.vertical_datum, UniqueSeries.ValueCount HAVING (dbo.ODM_SITES.SiteCode = @sitecode) AND (dbo.epa_odm_variables.VariableCode = @variableCode)

ScrewTurn Wiki version 3.0.1.400. Some of the icons created by FamFamFam.