Community Data Access

Community Data Access (CDA) is a Pentaho plugin designed for accessing data with great flexibility.

Born for overcoming some cons of the older implementation, CDA allows you to access any of the various Pentaho data sources and:

  • join different data sources just by editing an XML file;
  • cache queries providing a great boost in performance;
  • deliver data in different formats (csv, xls, etc.) through the Pentaho User Console;
  • sort and paginate data server-side.

Without worrying about the details CDA can be used as a standalone plugin on the Pentaho BI server or in combination with CDE / CDF.

Credits: Webdetails Team, Lead - Pedro Alves; Thomas Morgner; Daniel Einspanjer; Will Gorman; Sergio Ramazzina; Aaron Phillips; Pentaho Community

Documentation

Community Data Access (CDA) was designed to allow greater flexibility for datasources.

Most tools available can perform queries over a single datasource and there's no easy way to join data coming from different databases or in different languages (joining data from a SQL query with data from a MDX query). These tasks usually require an independent ETL job to synchronize different data sources and build a new database. CDA was developed as an abstraction tool between database connections and CDF (Community Dashboard Framework).

It allows data to be retrieved from multiple data sources and combined in a single output that can easily be passed on to dashboard components. It serves three main purposes:

  • To allow joining data from different sources;
  • To avoid SQL injection problems within CDF;
  • To allow an easier way to export data from queries.

Wondering who can provide you a fitting dashboard design? Consider Webdetails.

CDA uses two different components: the connection (the database or Pentaho datasource to use) and the dataAccess (a query over that connection).

Connections and queries are defined in a XML file (the CDA file) lying in the solution repository. Data access is done by a call to a specific dataAccess id in the CDA file.

Each data access may have parameters and calculated columns added to the results. It's also possible to have compoundDataAccess elements, which represent joins or unions on different queries.

CDA features an API to fetch query results, an editor to edit CDA files and a previewer to visualize the results of a query in table form.

Finally, export: query results can be returned from the API in various formats. Currently supported formats are JSON, XML, CSV, XLS and HTML visual mode.

A CDA file is a XML file defining the database connections and the queries over them. All its content must be enclosed in a CDADescriptor block.

<?xml version="1.0" encoding="utf-8"?>
	<CDADescriptor>
	(...)
</CDADescriptor>

Inside the main block we define datasources (the database connections), data accesses (the queries themselves) and compound data accesses (joins and unions of queries).

The CDA file can now also be written from inside the CDE just by defining all the datasources , data accesses and compound data accesses , which is simpler than writing all the code.

Connections are defined inside a DataSources block. Each connection is defined by a Connection block which contains all the necessary information. For each connection an (unique) id and type must be provided.

Supported Connection Types

  • metadata.metadata;
  • sql.jdbc;
  • sql.jndi;
  • mondrian.jdbc;
  • mondrian.jndi;
  • olap4j.jdbc;
  • olap4j.jndi;
  • scripting.scripting;
  • kettle.TransFromFile;
  • xpath.xPath.

olap4j , scripting , xpath and kettle connections are available only on CDA for Pentaho 3.6 onwards.

A connection defines the datasource and the way the connection is established.

Each connection type requires a different set of properties to be defined.

metadata.metadata

  • Name: the name of the connection.
  • XmiFile: the file containing the metadata definitions, exported from the Metadata repository.
  • DomainId: the Metadata domain as defined on the Pentaho metadata layer.

sql.jdbc

  • Name: the name of the connection.
  • Driver: the Java classname of the driver (e.g. com.mysql.jdbc.Driver ).
  • URL: the URL to connect to (e.g. jdbc:mysql://localhost:3306/database ).
  • User Name: the username to use
  • Password: the user's password

sql.jndi

  • Name : the name of the connection.
  • Jndi : the connection's parameters as defined in the context.xml file.

mondrian.jdbc

  • Name: the name of the connection.
  • Mondrian schema: mondrian schema file.
  • Driver: the Java classname of the driver (e.g. com.mysql.jdbc.Driver ).
  • URL: the URL to connect to (e.g. jdbc:mysql://localhost:3306/database ).
  • User Name: the username to use.
  • Password: the user's password.
  • Banded Mode: compact or classic.

mondrian.jndi

  • Name: the name of the connection.
  • Mondrian schema: mondrian schema file
  • Jndi: the connection's parameters as defined in the context.xml file.
  • Banded Mode: compact or classic. There is also the possibility of a denormalized mondrian.jdbc and a denormalized mondrian.jndi MDX queries that have the same options as the standard (normalized) queries. The differences between denormalized and standard (normalized) queries are explained on data accesses. The differences between compact and classic on Banded Mode are also explained on data accesses.

olap4j.jdbc

  • Name: the name of the connection.
  • Driver: the olap4j driver to use (typically mondrian.olap4j.MondrianOlap4jDriver ).
  • Url: the connection url, typically jdbc:mondrian: .
  • User name for the underlying database: the username to use.
  • Password for the underlying database: the user's password.
  • Driver for the underlying database: the Java classname of driver (e.g. org.hsqldb.jdbcDriver).
  • URL pointing at the underlying database: the database url (e.g. jdbc:hsqldb:hsql://localhost:9001/sampledata ).
  • Catalog to apply to the underlying database: path to the the Mondrian schema file (e.g. solution:/steel-wheels/analysis/steelwheels.mondrian.xml ).

olap4j.jndi

  • Name: the name of the connection.
  • Jndi: the connection's parameters as defined in the context.xml file There is also the possibility of a denormalized olap4j.jdbc and a denormalized olap4j.jndi MDX queries that have the same options as the standard (normalized) queries.

scripting.scripting

  • Name: the name of the connection.
  • Language: the scripting language used; only beanshell is currently supported.
  • InitScript: the initialization script to be run before the query.

kettle.TransFromFile

  • Name: the name of the connection.
  • Kettle Transformation File: the kettle transformation file to use.
  • Variable: he value of a variable to be used in the transformation; multiple variable properties may be defined, one for each variable to use

xpath.xpath

  • Name: the name of the connection.
  • DataFile: a XML file with the data to query.

Examples

You can find examples of all these types of connection in your solution repository, under bi-developers/cda/cdafiles .

Simple Data Accesses

A simple data access is a query over a connection. It's defined by a DataAccess block. CDA supports the following types of data accesses.

Inside a DataAccess block the query is defined inside a Query block, containing the query code.The following attributes must be defined:

  • Access Level: public or private; public data accesses are available from outside calls, while private data accesses can only be called from other data accesses (e.g. compound data accesses);
  • Parameters: lists the parameter's name, default value (the default value if the parameter value isn't specified when the data access is called) and type (String, Integer, Numeric, Date, StringArray, IntegerArray, NumericArray and DateArray);
  • Output: which columns to output and by which order;
  • Output Mode: the column's output mode, include or exclude the columns set above;
  • Columns: names of the columns;
  • Calculated Columns: the columns to be calculated by a given formula;
  • Query: the query itself;
  • Cache: the possibility to cache the results, either true or false (optional, defaults to true);
  • Cache Duration: time to keep the results in cache, in seconds (optional, defaults to 3600).

Reference information about the different query types

Banded Mode

When setting a Mondrian connection there is the possibility to choose between compact or classic Banded Mode.

This option can make a lot of difference in the output of the results. The following query will show the Sales from a given Market at a given Time :

select NON EMPTY {[Measures].[Sales]} ON COLUMNS,
   NON EMPTY ({[Markets].[EMEA].Children}) ON ROWS
from [SteelWheelsSales]
WHERE [Time].[2005]

In compact mode this query will output one column for Sales and one column for the selected Market , as shown below:

Markets Sales
Austria 68250.26000000001
Belgium 25040.11
Denmark 26012.870000000003
Finland 126851.70999999996
France 242956.4
Italy 41509.94
Spain 326798.16999999987
Sweden 31606.72
UK 40802.810000000005

However, if you choose the classic mode, the same query returns the full mondrian hierarchy:

[Markets].[(All)] [Markets].[Territory] [Markets].[Country] [Measures].[Sales]
All Markets EMEA Austria 68250.26000000001
All Markets EMEA Belgium 25040.11
All Markets EMEA Denmark 26012.870000000003
All Markets EMEA Finland 126851.70999999996
All Markets EMEA France 242956.4
All Markets EMEA Italy 41509.94
All Markets EMEA Spain 326798.16999999987
All Markets EMEA Sweden 31606.72
All Markets EMEA UK 40802.810000000005

Normalized vs. Denormalized MDX

CDA allows two different types of MDX queries: standard (normalized) or denormalized.It's not within the scope of this document to elaborate over the details of data normalization/denormalization, but the following example can shed some light over the differences between the two:

select
{[Measures].[Sales], [Measures].[Quantity]} ON COLUMNS,
NON EMPTY [Time].Children ON ROWS
from [SteelWheelsSales]
where ([Order Status].[${status}])

This query will generate one column for each measure in normalized form and have only one row for each child of the Time dimension. In denormalized form there will be only two columns added for measures, one indicating the name of the measure and another for its value, regardless of the number of measures queried.

The resulting dataset will have one row for each combination of measure and child of the Time dimension.

In a very loose way one can say that normalized minimizes row redundancy at the expense of the number of columns whereas denormalized form limits the number of columns at the expense of row redundancy.

So, in normalized form, the result of the above query is:

[Time].[(All)] [Time].[Years] [Measures].[Sales] [Measures].[Quantity]
All Years 2004 4750205.889999998 47151
All Years 2005 1513074.4600000002 14607
All Years 2003 3573701.2500000023 35313

In denormalized form, the same query returns:

[Time].[(All)] [Time].[Years] [Measures].[MeasuresLevel] Measure
All Years 2003 Quantity 35313
All Years 2004 Quantity 47151
All Years 2005 Quantity 14607
All Years 2005 Sales 1513074.4600000002
All Years 2004 Sales 4750205.889999998
All Years 2003 Sales 3573701.2500000023

Parameters

Parameters are variables that can be passed on to the query, for example to create dynamic queries depending on user input.The query uses two parameters, status and orderDate. Parameters are always passed to the query as ${parameterName}.

select o.YEAR_ID, o.STATUS, sum(o.TOTALPRICE) as price from orderfact o
where o.STATUS = ${status} and o.ORDERDATE > ${orderDate}
group by o.YEAR_ID, o.STATUS

The definition of parameters is done inside a Parameters block, where the various parameters are defined. For each parameter the type and default attributes must be specified. A third attribute, pattern must be specified for parameters of the date type.

  • type : String, Integer, Numeric (double), Date, StringArray, IntegerArray, NumericArray and DateArray;
  • default : the default value if the parameter value isn't specified when the data access is called;
  • pattern : date format. All formats accepted by the SimpleDateFormat Java class are valid.

Check the example below:

 <Parameters>
<Parameter name="status" type="String" default="Shipped"/>
<Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>
</Parameters>

Columns

The Columns block supports renaming of columns and defining of new columns, using formulas. Each column may be renamed using a Column block, by specifying its name:

<Column idx="1">   
	<Name>Year</Name>   
</Column>

Calculated columns are added by a CalculatedColumn block, where the new column's name and the formula to use are defined.

Each calculated column requires two properties: Name (the name that will be output by CDA), and Formula (the column's definition itself).

Formulas are written in Open Formula format.

<CalculatedColumn>    
	<Name>PriceInM</Name>                           
	<Formula>=[price]/1000000</Formula>   
</CalculatedColumn>

Output

The Output tag define which columns to output and by which order. If not specified, all columns from the query will be returned in the same order as defined on the query.

Output requires only one attribute, the indexes to be returned. The order of the output columns is defined by the order in which the indexes are specified.

<Output indexes="1,0,3"/>
  • sql;
  • mql;
  • xPath;
  • mdx;
  • olap4j;
  • kettle;
  • denormalizedMdx;
  • scriptable.

Compound Data Accesses

Compound data accesses combine the results of two queries. They're defined by a CompoundDataAccess block.

There are two types of compound data accesses supported: union and join .

Union

Unions take the results of two queries with the same number of columns and return the rows from both data accesses on top of each other.

A union compound data access requires only these attributes:

  • Name: the name of the compound query;
  • Top: the id of the query that will stay on top;
  • Bottom: the id of the query that will stay on the bottom;
  • Parameters: lists the parameter's name, default value (the default value if the parameter value isn't specified when the data access is called) and type (String, Integer, Numeric, Date, StringArray, IntegerArray, NumericArray and DateArray) that are passed on to the compound query.

If the columns from both queries have different names, it will prevail the names of the query that is on top. Take a look at the example below.

<CompoundDataAccess id="3" type="union">
   <Top id="2"/>
   <Bottom id="1"/>
   <Parameters>
      <Parameter name="year" type="Numeric" default="2004"/>
   </Parameters>
</CompoundDataAccess>

Join

Joins take the results of two queries and join them by a specified set of keys. The result has the columns of both queries. The join performed is full outer.

Join compound data accesses require that both the Left and Right data accesses be specified (by id) as well as the keys (column ids on the source queries) by which to join.

A join compound data access requires the following attributes:

  • Name: the name of the compound query;
  • Left: the first query;
  • Left Keys: the id or ids of the columns from the first query that are common to the second query;
  • Right: the second query;
  • Right Keys: the id or ids of the columns from the second query that are common to the first query;
  • Parameters: lists the parameter's name, default value (the default value if the parameter value isn't specified when the data access is called) and type (String, Integer, Numeric, Date, StringArray, IntegerArray, NumericArray and DateArray) that are passed on to the compound query;
  • Output Options: the ids of the columns to output from both queries by order, starting with the columns from the left query and them the ones from the right query;
  • Output Mode: the column's output mode, include or exclude the columns set above The columns that are common in both queries must be of the same type. Take a look at the example below.

The columns that are common in both queries must be of the same type. Take a look at the example below.

<CompoundDataAccess id="3" type="join">
    <Left id="1" keys="0,1"/>
    <Right id="2" keys="0,1"/>
    <Parameters>
       <Parameter name="status" type="String" default="Shipped"/>
       <Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>
    </Parameters>
    <Output indexes="0,1,2,5,6"/>
</CompoundDataAccess>

CDA is called by HTTP requests to BASE_URL/pentaho/content/cda/ .

Methods Supported

  • doQuery;
  • listQueries;
  • getCdaList;
  • listParameters;
  • clearCache;
  • listDataAccessTypes.

CDA File

Some of the methods listed above require the CDA file to be specified in the URL.This can be done in two different ways:

  • providing the solution folder, the path to the file from the solution and the name of the CDA file.
  • providing the full path from the solution repository.

Check the example below.

solution=bi-developers&path=cda/cdafiles&file=compoundUnion.cda 

is the same as

path=bi-developers/cda/cdafiles/compoundUnion.cda

Output

All methods listed above (except doQuery and clearCache) output the results as a JSON formatted object, with two key values: resultset with the data itself and metadata with info about column indexes, names and data types.

Check the example of a listQueries output:

{"resultset":[
	["3","","union"],
	["2","Sql Query on SampleData","sql"]],
"metadata":[
	{"colIndex":0,"colType":"String","colName":"id"},
	{"colIndex":1,"colType":"String","colName":"name"},
	{"colIndex":2,"colType":"String","colName":"type"}
]}

clearCache returned the string "Cache cleared".

doQuery returns the query results in one of five formats: JSON, XML, HTML, XLS or CSV. The default is JSON.

Method Reference

doQuery

doQuery establishes the necessary connections and performs the query specified by the dataAccessId parameter.

You may specify the values of the different parameters used by the query (if not specified default values defined on the CDA file will be used). Parameters are passed to the URL as paramParameter, where Parameter is the parameter's name.

Check this example: if the data access with id=1 of the file $SOLUTION_REPOSITORY/solution/cda/file.cda has a parameter called Status and you want the query to use Shipped as the parameter the URL to use is:

$BASE_URL/pentaho/content/cda/doQuery?path=solution/cda/file.cda&dataAccessId=1&paramStatus=Shipped 

doQuery has an optional argument, outputType , where the format of the returned data is specified as either json , xml , csv , xls or html . The default value is json .

listQueries

This method returns a JSON formatted list of all public queries in the specified CDA file.

getCdaList

Returns a JSON formatted list of all CDA files in the solution repository. Doesn't take any arguments.

listDataAccessTypes

Returns a JSON formatted list with all supported data access types and connections and their definitions.

listParameters

Returns a list of parameters used by a data access. Requires both the path and dataAccessId to be specified.

The CDA previewer is an application that allows CDA queries to be returned as HTML and displayed on table form.

Parameters can be changed using input boxes and a dropdown menu allows changing the dataAccessId. The previewer is called from the URL.

Check the example below.

$BASE_URL/pentaho/content/cda/previewQuery?solution='path'&path=&file='filename'.cda 

Specifying the path argument as the directory inside the solution directory where the CDA file is, specified in filename by the name of the CDA file.

An embedded editor allows editing CDA files under the scope of the Pentaho server, avoiding the need to use external editors and manually refreshing the solution repository.

This only works on some browsers, like Firefox or Chrome. The editor is called from the URL. Check the example below.

$BASE_URL/pentaho/content/cda/editFile?solution='path'&path=&file='filename'.cda 

Specifying the path argument as the directory inside the solution directory where the CDA file is, specified in filename by the name of the CDA file.

The editor does not allow creating new CDA files. You must manually create a blank CDA file on the solution repository and refresh the solution repository to be able to edit the file.

See the solution repository of your Pentaho installation for sample CDA files.

Wondering who can provide you a fitting dashboard design? Consider Webdetails.

FAQ

What does CDA mean?

CDA stands for Community Data Access

Is CDA free?

CDA is licensed under the MPLv2 license.

I just want to develop dashboards with CDE. Do I have to learn all about CDA?

For developing dashboards, you don't need to know the CDA API. However it's important that you know about Connections and Data access, as the CDE data sources rely on these concepts.

Check our demo. We can build a matching solution for your scenario too. Ask us for a quote.

Check our demo. We can build a matching solution for your scenario too.

Recommend this article:

Registered in Portugal. NIF 508 528 283