Community Data Generator

Community Data Generator (CDG) is a Data Warehouse generator and the youngest member of the CTools family.

Given the definition of the dimensions that we want, CDG will randomize data within specific parameters, and output 3 different things:

  • Database and table for the fact table;
  • A file with inserts for the fact table;
  • Mondrian schema file to be used within Pentaho.

While most of the documentation mentions the usage within the scope of Pentaho, there's absolutely nothing that prevents the resulting database to be used in different contexts.

Credits: Webdetails Team, Lead - Pedro Alves

Documentation

Version 1 of CDG is called by editing a kettle transformation. In the source you'll find a src directory, but this one is our "development area". The code you need to use is inside kettle.

There are only 2 things that need to be changed:

  • In Transformation Properties you can define the name of the database and fact table;
  • In Dimension Info you configure the transformation parameters.

Change the parameters you want (or just run with the default), run the transformation, and 3 files will appear in the output directory: database and table ddl file, insert scripts and the Mondrian schema file.

Setting up totals

Inside the Dimension Info step you'll find a mention to the total:

/* SET THE APPROXIMATE TOTAL FOR THE VALUES */

var total = 5000;

Setting up dimensions

This is done within the same file you setup the dimensions. You can have as much as you want, but you must pay attention to the fact that if you use lots of dimensions / high cardinality we can quickly end up with a huge database.

While there's nothing particularly wrong with that, it's up to you doing specific optimizations like indexes or even aggregate tables. That's outside the scope of CDG.

Here's a sample dimension definition:

var countries = [
  {countryName: "Italianos", proportion: 30}, 
  {countryName: "Portugueses", proportion: 18}, 
  {countryName: "Alemães", proportion: 12}, 
  {countryName: "Espanhóis", proportion: 10}, 
  {countryName: "Japoneses", proportion: 15}, 
  {countryName: "Coreanos", proportion: 8}, 
  {countryName: "Chineses", proportion: 3}, 
  {countryName: "Outros", proportion: 5} 
];

The sample provided in CDG is written in Portuguese, in order to specifically test character encoding support. The generated files are encoded in UTF-8 and we always recommend using this standard in databases too.

By defining this object, CDG will create a dimension with 8 members and one level called countryName . You can have other properties in there and CDG will create a Mondrian schema with different levels. The provided example has only one.

There's a special property in there called proportion - it will be used by CDG to do the breakdown of the total. In the example, roughly 30% of the total will be assigned to Italians and so on. In every aspects of the code there's a random factor in place.

Configuring Date Dimension

The date dimension is always a specific case, since it usually acts as a snapshot dimension.

Since configuring all possible members of this dimension would take too much time, we provide an utility function that generates all the dates between 2000 and 2012 down to the month.

This is standard JavaScript, so feel free to change this function in order to change the date range, month names or even adding the day level (be aware that adding the day level will substantially increase the number of values in the fact table).

/* CONFIGURE THE DATE DIMENSION. */

var dateDim = [];

var months = [
  [1,"Jan","Janeiro"], [2,"Fev","Fevereiro"], [3,"Mar","Março"], [4,"Abr","Abril"],
  [5,"Mai","Maio"], [6,"Jun","Junho"], [7,"Jul","Julho"], [8,"Ago","Agosto"],
  [9,"Set","Setembro"], [10,"Out","Outubro"], [11,"Nov","Novembro"], 
  [12,"Dec","Dezembro"]
];

range(2000,2012).map(function(year){
  range(0,12).map(function(month){
    var m = months[month];
    dateDim.push(
      {"year":year ,"monthNo": m[0], "monthAbbrev":m[1], "monthDesc": m[2]}
    );
  });
});
;

Final Configuration

In the end of the script there's the final configuration that will be used by CDG:

 /* MAKE THE FINAL CONFIGURATION. DIMENSIONS CAN EITHER BE SNAPSHOT OR REGULAR BREAKDOWNS */

var outputArray = [

  {name: "Date", dimension: dateDim, toBreakdown: false, increment: 0.05 }, 
  {name: "Provices", dimension: provinces, toBreakdown: true },
  {name: "Countries", dimension: countries, toBreakdown: true },
  {name: "Dates", dimension: gender, toBreakdown: true } 

]

There we define the names and types of the dimension. The property toBreakdown should be true for normal dimensions and false for snapshot dimensions.

If it's a snapshot dimension, you need to specify the increment property. The value 0.05 basically means that we'll have roughly 5% increase each month.

After that, you can run the transformation.

Output

After running the transformation, we get this output:

  • A .ddl file to create the database and the table;
  • A file with SQL inserts to populate the Data Warehouse;
  • A Mondrian schema file to use within Mondrian or to import to Pentaho.
cdg/kettle/output
├── cdgsample.ddl
├── cdgsample.mondrian.xml
└── cdgsample.sql

Result

The result, after declaring this new datasource and registering the cube in Mondrian, is a new cube free to use.

FAQ

What does CDG mean?

CDG stands for Community Data Generator.

Is CDG free?

CDG is released under the MPLv2 license.

Recommend this article:

Registered in Portugal. NIF 508 528 283