CDA

Community Data Access

Version 1.0, 2010/04/30

Table of Contents

Introduction

What is CDA

CDA (Community Data Access) is a plugin for Pentaho Business Analysis suite, designed to allow greater flexibility for data sources. Most tools available can perform queries over a single data source and there's no easy way to join data coming from different databases or in different languages (joining data from a SQL query with data from a MDX query). These tasks usually require an independent ETL job to synchronize different data sources and build a new database.

CDA was developed as an abstraction tool between database connections and CDF (Community Dashboard Framework). It allows data to be retrieved from multiple datasources and combined in a single output that can easily be passed on to dashboard components.

It serves three main purposes:

  • To allow joining data from different sources;
  • To avoid SQL injection problems within CDF;
  • To allow an easier way to export data from queries.

Concepts

CDA uses two different components: A connection, the database or pentaho datasource to use, and a dataAccess, a query over that connection.

Connections and queries are defined in a XML file (the CDA file) lying in the solution repository. Data access is done by a call to a specific dataAccess id in the CDA file.

Each data access may have parameters and calculated columns added to the results. It's also possible to have compoundDataAccess elements, which represent joins or unions on different queries.

CDA features an API to fetch query results, an editor to edit CDA files and a previewer to visualize the results of a query in table form.

Finally, export: query results can be returned from the API in various formats. Currently supported are JSON, XML, CSV and XLS.

CDA files

A CDA file is a XML file defining the database connections and the queries over them. All it's content must be enclosed in a CDADescriptor block.

<?xml version="1.0" encoding="utf-8"?>
<CDADescriptor>
    (...)
</CDADescriptor>

Inside the main block we define datasources (the database connections), data accesses (the queries themselves) and compound data accesses (joins and unions of queries).

Connections

Connections are defined inside a DataSources block. Inside it each connection is defined by a Connection block which contains all necessary information. For each connection an (unique) id and type must be provided.

CDA supports the following connection types:

  • metadata.metadata
  • sql.jdbc
  • sql.jndi
  • mondrian.jdbc
  • mondrian.jndi
  • olap4j.jdbc
  • scripting.scripting
  • xpath.xPath
  • kettle.TransFromFile

olap4j, scripting, xpath and kettle connections are available only on CDA for Pentaho 3.6.

A connection defines the datasource and the way the connection is established. Each connection type requires a different set of properties to be defined.

metadata.metadata

  • DomainId: the Metadata domain as defined on the Pentaho metadata layer
  • XmiFile: the file containing the metadata definitions, exported from the Metadata repository

sql.jdbc

Simple - Driver: the Java classname of the driver (e.g. com.mysql.jdbc.Driver) - Host: the URL to connect to (e.g. jdbc:mysql://localhost:3306/database) - User: the username to use - Pass: the user's password

sql.jndi

  • Jndi: the connection name as defined on the datasources.xml file

mondrian.jdbc

  • Driver: the Java classname of the driver (e.g. com.mysql.jdbc.Driver)
  • Host: the URL to connect to (e.g. jdbc:mysql://localhost:3306/database)
  • User: the username to use
  • Pass: the user's password
  • Catalog: the path to the mondrian schema file to use

mondrian.jndi

  • Jndi: the connection name as defined on the datasources.xml file
  • Catalog: path to the mondrian schema file

olap4j.jdbc

  • Driver: the olap4j driver to use (typically mondrian.olap4j.MondrianOlap4jDriver)
  • Url: the connection url, typically jdbc:mondrian:
  • JdbcDrivers: the Java classname of driver (e.g. org.hsqldb.jdbcDriver)
  • Jdbc: the database url (e.g. jdbc:hsqldb:hsql://localhost:9001/sampledata)
  • JdbcUser: the username to use
  • JdbcPassword: the user's password
  • Catalog: path to the the mondrian schema file

scripting.scripting

  • Language: the scripting language used; only beanshell is currently supported
  • InitScript: the initialization script to be run before the query

xpath.xpath

  • DataFile: a XML file with the data to query

kettle.TransFromFile

  • KtrFile: the kettle transformation file to use
  • variable: the value of a variable to be used in the transformation; multiple variable properties may be defined, one for each variable to use

Examples

You can find examples of all these types of connection in your solution repository, under bi-developers/cda/cdafiles.

Data accesses

Simple data accesses

A simple data access is a query over a connection. It's defined by a DataAccess block. CDA supports the following types of data accesses:

  • sql
  • mdx
  • denormalizedMdx
  • mql
  • olap4j
  • scriptable
  • xPath
  • kettle

The following attributes must be defined:

  • id: the unique identifier of the data access
  • connection: the id of the connection to use
  • type: either sql, mdx, denormalizedMdx or mql, depending on the query type
  • access: public or private; public data accesses are available from outside calls, while private data accesses can only be called from other data accesses (e.g. compound data accesses)
  • cache: either true or false (optional, defaults to true)
  • cacheDuration: time to keep the results in cache, in seconds (optional, defaults to 3600)

Inside a DataAccess block the query is defined inside a Query block, containing the query code.

The data access may be parametrized using the following blocks:

  • Name: The Name property sets an alias by which a data access can be refered to. If not specified the data access is identified by its id
  • Columns: definition calculateed columns and column names
  • Parameters: lists the parameter names, types and default values to be used by the query
  • Output: indexes of columns to output

Reference information about the different query types

For information about the different query languages see the following reference sites:

Normalized vs. Denormalized MDX

CDA allows two different types of MDX queries: standard (normalized) or denormalized. It's not within the scope of this document to elaborate over the details of data normalization/denormalization, but the following example can shed some light over the differences between the two:

select
    {[Measures].[Sales], [Measures].[Quantity]} ON COLUMNS,
    NON EMPTY  [Time].Children ON ROWS
from [SteelWheelsSales]
where ([Order Status].[${status}])

This query will generate one column for each measure in normalized form and have only one row for each child of the Time dimension. In denormalized form there will only two columns added for measures, one indicating the name of the measure and another for its value, regardless of the number of measures queried. The resulting dataset will have one row for each combination of measure and child of the Time dimension.

In a very loose way one can say that normalized minimizes row redundancy at the expense of the number of columns whereas denormalized form limits the number of columns at the expense of row redundancy.

So, in normalized form, the result of the above query is:

[Time].[(All)] [Time].[Years] [Measures].[Sales] [Measures].[Quantity]
All Years 2004 4750205.889999998 47151
All Years 2005 1513074.4600000002 14607
All Years 2003 3573701.2500000023 35313

In denormalized form, the same query returns:

[Time].[(All)] [Time].[Years] [Measures].[MeasuresLevel] Measure
All Years 2003 Quantity 35313
All Years 2004 Quantity 47151
All Years 2005 Quantity 14607
All Years 2005 Sales 1513074.4600000002
All Years 2004 Sales 4750205.889999998
All Years 2003 Sales 3573701.2500000023

Parameters

Parameters are variables that can be passed on to the query, for example to create dynamic queries depending on user input.

The query

select o.YEAR_ID, o.STATUS, sum(o.TOTALPRICE) as price from orderfact o
where o.STATUS = ${status} and o.ORDERDATE &gt; ${orderDate}
group by o.YEAR_ID, o.STATUS

Uses two parameters, status and orderDate. Parameters are always passed to the query as ${parameterName}.

The definition of parameters is done inside a `Parameters block, where the various parameters are defined. For each parameter the type and default attributes must be specified. A third attribute, pattern must be specified for parameters of the date type.

  • type: Numeric (double), Integer, String or Date;
  • default: the default value if the parameter value isn't specified when the data access is called.
  • pattern: date format. All formats accepted by the SimpleDateFormat Java class are valid.

Example:

<Parameters>
    <Parameter name="status" type="String" default="Shipped"/>
    <Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>
</Parameters>

Columns

The Columns block supports renaming of columns and defining of new columns, using formulas.

Each column may be renamed using a Column block, by specifying its name:

<Column idx="1">
    <Name>Year</Name>
</Column>

Calculated columns are added by a CalculatedColumn block, where the new column's name and the formula to use are defined. Each calculated column requires two properties: Name (the name that will be output by CDA), and Formula (the column's definition itself).

<CalculatedColumn>
    <Name>PriceInM</Name>
    <Formula>=[price]/1000000</Formula>
</CalculatedColumn>

Formulas are written in Open Formula format.

Output

The Output tag define which columns to output and by which order. If not specified all columns from the query will be output in the same order as defined on the query. Output requires only one attribute, the indexes to be output. The order of the output columns is defined by the order in which the indexes are specified.

<Output indexes="1,0,3"/>

Compound data accesses

Compound data accesses combine the results of two queries. They're defined by a CompoundDataAccess block. There are two types of compound data accesses supported: union and join.

Unions take the results of two queries with the same number of columns and returning the rows from both data accesses on top of each other.

Joins take the results of two queries and join them by a specified set of keys. The result has the columns of both queries. The join performed is full outer.

A compound data access requires only two attributes:

  • id: the unique data access identifier
  • type: either join or union

If the two source data accesses have parameters these must also be defined in the compound data access, inside a Parameters block. Compound data accesses may also have Name, Columns and Output blocks.

Join

Join compound data accesses require that both the Left and Right data accesses be specified (by id) as well as the keys (column ids on the source queries) by which to join.

<CompoundDataAccess id="3" type="join">
    <Left id="1" keys="0,1"/>
    <Right id="2" keys="0,1"/>
    <Parameters>
        <Parameter name="status" type="String" default="Shipped"/>
        <Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>
    </Parameters>
    <Output indexes="0,1,2,5,6"/>
</CompoundDataAccess>

Union

Union compound data accesses require the definition of Top and Bottom queries (by id).

<CompoundDataAccess id="3" type="union">
    <Top id="2"/>
    <Bottom id="1"/>
    <Parameters>
        <Parameter name="year" type="Numeric" default="2004"/>
    </Parameters>
</CompoundDataAccess>

Web API reference

CDA is called by http requests to BASE_URL/pentaho/content/cda/.

The following methods are currently supported:

CDA file

Some of the methods listed above require that the CDA file be specified on the URL. This can be done in two different ways:

  • providing the solution folder, the path to the file from the solution and the name of the CDA file.
  • providing the full path from the solution repository.

Example:

solution=bi-developers&path=cda/cdafiles&file=compoundUnion.cda

is the same as

path=bi-developers/cda/cdafiles/compoundUnion.cda

Output

All methods listed above (except doQuery and clearCache) output the results as a JSON formatted object, with two key values: resultset with the data itself and metadata with info about column indexes, names and data types.

Example of a listQueries output:

{"resultset":[
        ["3","","union"],
        ["2","Sql Query on SampleData","sql"]],
 "metadata":[
        {"colIndex":0,"colType":"String","colName":"id"},
        {"colIndex":1,"colType":"String","colName":"name"},
        {"colIndex":2,"colType":"String","colName":"type"}]}

clearCache returned the string "Cache cleared".

doQuery returns the query results in one of four formats: JSON, XML, XLS or CSV. The default is JSON.

Method reference

doQuery

doQuery establishes the necessary connections and performes the query specified by the dataAccessId parameter.

You may specify the values of the different parameters used by the query (if not specified default values defined on the CDA file will be used). Parameters are passed on the URL as paramParameter, where Parameter is the parameter's name.

Example:: if the data access with id=1 of the file $SOLUTION_REPOSITORY/solution/cda/file.cda has a parameter called Status and you want the query to use Shipped as the parameter the URL to use is

$BASE_URL/pentaho/content/cda/doQuery?path=solution/cda/file.cda&dataAccessId=1&paramStatus=Shipped

doQuery has an optional argument, outputType where the format of the returned data is specified as either `json, xml, csv or xls. The default value is json.

listQueries

This method returns a JSON formatted list of all public queries in the specified CDA file.

getCdaList

Returns a JSON formatted list of all CDA files in the solution repository. Doesn't take any arguments.

listParameters

Returns a list of parameters used by a data access. Requires both the path and dataAccessId to be specified.

clearCache

Clears the contents of the cache. Takes no arguments.

listDataAccessTypes

Returns a JSON formatted list with all supported data access types and connections and their definitions.

CDA previewer

The CDA previewer is an application that allows CDA queries to be returned as HTML and displayed on table form. Parameters can be changed using input boxes and a dropdown menu allows changing the dataAccessId.

The previewer is called from the URL

Example:

$BASE_URL/pentaho/content/cda/previewQuery?path=solution/cda/file.cda

specifying the filename in the path argument.

CDA Editor

An embedded editor allows editing CDA files under the scope of Pentaho server, avoiding the need to use external editors and manually refreshing the solution repository. This only works on some browsers, like firefox or chrome.

The editor is called from the URL

Example:

$BASE_URL/pentaho/content/cda/editFile?path=solution/cda/file.cda

specifying the filename in the path argument. The editor does not allow creating new CDA files. You must manually create a blank CDA file on the solution repository and refresh the solution repository to be able to edit the file.

Examples

See the solution repository of your Pentaho installation for sample CDA files.

Further information

For more information about CDA, contact us at imagine@webdetails.pt or visit our website