Community Data Validation

Community Data Validation (CDV) brings the ability to create validation tests on the Pentaho Business Analytics server, in order to verify both the integrity of the server itself and the data being used by the server.

Credits: Webdetails Team, Lead - Pedro Alves

Motivation

Very often, we want that several questions related to the data would be answered before we noticed it:

  • Do we have up to date data?
  • Can we trust the data?
  • How long did the queries take to run?
  • Do we have wrong data? (duplicated users in community)
  • Be able to validate big chunks of reprocessing
  • Did the data format change?
  • Is the data stalled? (eg: number of twitter followers not updating)
  • Do we have peaks or valleys in the data? (due to double process or no process)
  • We need a way to handle known effects. (eg: Christmas dip)
  • We need to correlate independent data sources.
  • Do we have a big number of 'unknowns'? (tk=1 in DW)
  • Do we have clearly wrong rows in resultset? (e.g.: a line there)

So we decided to build CDV - a data validator that periodically does a set of tests that answer all the above questions.

You can see the original RFC for CDV. Note that not every feature has been implemented.

Wondering who can provide you a fitting dashboard design? Consider Webdetails.

Wondering who can provide you a fitting dashboard design? Consider Webdetails.

Documentation

Main Features

See existing validations

  • Allows firing a specific validation
  • Get the URL of a specific validation / all validations

CDA Query error dashboard

  • Query and parameters
  • Errors
  • Incidents

Create / Edit validation

  • Define query name
  • Define queries and parameters
  • Define validation functions
  • Chose log alerts (when to throw error / severe / warn / ok)
  • Define error messages
  • Choose duration thresholds
  • Define crons

Validation status dashboard

Duration Dashboard to identify slow points in the system

  • Query and parameters
  • Duration
  • Incidents

Check our demo. We can build a matching solution for your scenario too. Ask us for a quote.

Creating Validations

All the validations created will be stored in solution/cdv/tests/ .

The files will have the format Name.cdv and will internally be a JSON file with the following structure:

cdv.registerTest({
  id: 1,
  type: "query",
  name: 'test',
  group: "",
  path: 'cdv/test/test.cdv',
  createdBy: '',
  createdAt: 1339430893246,
  queries: [
  {
    cdaFile: "",
    dataAccessId: "",
    parameters: {
    }
  }
  ],
  validations: [{
    validationName: "",
    validationType: "",
    validationFunction: function(rs, conf) {
    }
  }],
  executionTimeValidation: {
    expected: 100,
    warnPercentage: 0.30,
    errorPercentage: 0.70,
    errorOnLow: true
  },
  cron: "0 0 10 * ? *"
}),

Each test must have an unique id.

You can set different datasources by selecting different CDA files, and set the tests with the following structure:

validation: [
  { cdaFile: "/solution/cda/test.cda", dataAccessId: "1" , parameters: [...] },
  { cdaFile: "/solution/cda/test2.cda", dataAccessId: "2" , parameters: [...] }
],
tests:[
{
  validationType: "custom",
  validationFunction: function(rs, conf) {
    var exists = rs.map(function(r){return r.length > 0}).reduce(function(prev, curr){
      return conf.testAll ? (curr && prev) : (curr || prev);
    });
    return exists ? Alarm.ERROR : Alarm.OK;
  }
}],

We'll be using the Steel-Wheels Sample Data solution to create some examples of the validations that can be done.

As an example, the following MDX query returns the # of Quantity and Sales for a specific year:

select NON EMPTY {[Measures].[Quantity], [Measures].[Sales]} ON COLUMNS,
{Descendants
  ( [Time].[${yearParameter}],
  [Time].[${yearParameter}], AFTER
  )
} ON ROWS
from [SteelWheelsSales]

For example, for 2003 the resultset is:

["QTR1", 4561, 445094.69], ["Jan", 1357, 129753.6], ["Feb", 1449, 140836.19000000006], ["Mar", 1755, 174504.89999999997],
["QTR2", 5695, 564842.02], ["Apr", 1993, 201609.55], ["May", 2017, 192673.11000000002], ["Jun", 1685, 170559.36000000004],
["QTR3", 6629, 687268.8699999998], ["Jul", 2145, 225486.21000000002], ["Aug", 1974, 197809.30000000002], ["Sep", 2510, 263973.36],
["QTR4", 19554, 1980178.4199999995], ["Oct", 5731, 589963.9], ["Nov", 10862, 1086720.4000000001], ["Dec", 2961, 303494.11999999994]

We can test if the resultset has data for all months and quarters as expected, and test the variations between months and quarters to detect peaks or valleys in the data due to double process or no process.

For this end, there is a test as example, using the CDA file with the MDX query shown above:

cdv.registerTest({
  id: 99999,
  type: "query",
  name: 'Existence of data',
  group: "Steel-Wheels",
  path: 'cdv/tests/steelwheels-existence.cdv',
  createdBy: 'Webdetails',
  createdAt: 1339430893246,
  queries: [
  {
    cdaFile: "/plugin-samples/cdv/steelwheels-tests.cda",
    dataAccessId: "monthlyQuery" ,
    parameters: {
      yearParameter: "2003"
    }
  }
  ],
  validations:[{
    validationName: "Steel-Wheels Data Validation",
    validationType: "custom",
    validationFunction: function(rs, conf) {
      var success = true,
      dif1 = [], dif2 = [];
      //Test existence of data
      var i = rs[0].resultset.length;
      if ( i < 16 ) {
        return {type: "ERROR", description: "Missing data in Steels-Wheels!"};
      }
      return success ? "OK" : {type: "ERROR", description: "Missing data in Steels-Wheels!"};
    }
  }],
  executionTimeValidation: {
    expected: 100,
    warnPercentage: 0.30,
    errorPercentage: 0.70,
    errorOnLow: false   },
  cron: "0 0 10 * ? *"
});

Since the resultset is supposed to show all quarters and months of a year, we expect that it has 16 rows, so we can use that to check if we have all the data for the year we want. As alternative we can also check if there are values on the resultset.

If the test fails, the message is shown settled in the return command:

Steel-Wheels: Existence of data and variation - 1 ERROR [Missing data in Steels-Wheels!]

If the previous test passes, then it will do the variations tests. It uses a trigger of lower than 10% per quarter variation and greater than 200% per month variation, but you're free to change it as you will, and watch the results.

At the bottom, in executionTimeValidation , you can set the expected time that a query should take to run and set the warnPercentage and errorPercentage margin to receive alerts when a certain query takes too long to run. Also if the query runs too fast you should receive an alerts if you set the errorOnLow to true.

For last, you can schedule a time for the test to run automatically on the cron line, using the cron predefined scheduling definitions.

If we're using another CDA file or had another dataAccessId returning a new resultset, it can be called with rs[1].resultset , where the value 1 is in agreement with the order settled in the queries section, from top to bottom.

The tests will be sorted by groups, defined within the creation of each test. In each group, each line corresponds to a test, where we can see the name of this test, the path to the CDA file used as Datasource, the Validation name, the Expected Duration of the query, the Status of the test and an Options button.

If the test returned a WARN at the last time it was run, the font color will change to orange, and in case of an ERROR, will change to red.

Invocation and Scheduling

There are 2 ways to call the validations:

  • By URL Request;
  • Scheduled Calls.

The URL will be based on the ID / query name (tbd). The schedule calls are cron based.

Over the Alerts area, you're able see the running tests, sorted by time of run, and filter them by the Status of the test.

Within the CDA Errors, there will be a list of every single error found on a CDA query.

Inside the Slow Queries, there is shown a list of the queries that took longer to execute than the estimated time set in the CDV file for the test.

External Interfaces Supported

  • E-mail;
  • HTTP;
  • Nagios integration;
  • Server up check.

The last one is a very specific check. All the other integrations will fail if the server suddenly hangs, and we must be notified of that. On HTTP and Nagios integration, we'll be able to get reports not only on the individual tests, but also on the test groups. This will not rerun the tests, but get the report on the last status of a test.

On the HTTP case, we can pass a flat to force a test to be rerun.

For Nagios, we can have an export of test rules.

Check our demo. We can build a matching solution for your scenario too.

FAQ

What does CDV mean?

CDV stands for Community Data Validation.

Is CDV free?

CDV is licensed under the MPLv2 license.

Check our demo. We can build a matching solution for your scenario too. Ask us for a quote.

Recommend this article:

Registered in Portugal. NIF 508 528 283