Version 1.0, 2010/04/30
Table of Contents
CDA (Community Data Access) is a plugin for Pentaho Business Analysis suite, designed to allow greater flexibility for data sources. Most tools available can perform queries over a single data source and there's no easy way to join data coming from different databases or in different languages (joining data from a SQL query with data from a MDX query). These tasks usually require an independent ETL job to synchronize different data sources and build a new database.
CDA was developed as an abstraction tool between database connections and CDF (Community Dashboard Framework). It allows data to be retrieved from multiple datasources and combined in a single output that can easily be passed on to dashboard components.
It serves three main purposes:
CDA uses two different components: A connection, the database or pentaho datasource to use, and a dataAccess, a query over that connection.
Connections and queries are defined in a XML file (the CDA file) lying in the solution repository. Data access is done by a call to a specific dataAccess id in the CDA file.
Each data access may have parameters and calculated columns added to the results. It's also possible to have compoundDataAccess elements, which represent joins or unions on different queries.
CDA features an API to fetch query results, an editor to edit CDA files and a previewer to visualize the results of a query in table form.
Finally, export: query results can be returned from the API in various formats. Currently supported are JSON, XML, CSV and XLS.
A CDA file is a XML file defining the database connections and the queries over them. All it's content must be enclosed in a CDADescriptor block.
<?xml version="1.0" encoding="utf-8"?>
<CDADescriptor>
(...)
</CDADescriptor>
Inside the main block we define datasources (the database connections), data accesses (the queries themselves) and compound data accesses (joins and unions of queries).
Connections are defined inside a DataSources block. Inside it each connection is defined by a Connection block which contains all necessary information. For each connection an (unique) id and type must be provided.
CDA supports the following connection types:
olap4j, scripting, xpath and kettle connections are available only on CDA for Pentaho 3.6.
A connection defines the datasource and the way the connection is established. Each connection type requires a different set of properties to be defined.
Simple - Driver: the Java classname of the driver (e.g. com.mysql.jdbc.Driver) - Host: the URL to connect to (e.g. jdbc:mysql://localhost:3306/database) - User: the username to use - Pass: the user's password
You can find examples of all these types of connection in your solution repository, under bi-developers/cda/cdafiles.
A simple data access is a query over a connection. It's defined by a DataAccess block. CDA supports the following types of data accesses:
The following attributes must be defined:
Inside a DataAccess block the query is defined inside a Query block, containing the query code.
The data access may be parametrized using the following blocks:
For information about the different query languages see the following reference sites:
CDA allows two different types of MDX queries: standard (normalized) or denormalized. It's not within the scope of this document to elaborate over the details of data normalization/denormalization, but the following example can shed some light over the differences between the two:
select
{[Measures].[Sales], [Measures].[Quantity]} ON COLUMNS,
NON EMPTY [Time].Children ON ROWS
from [SteelWheelsSales]
where ([Order Status].[${status}])
This query will generate one column for each measure in normalized form and have only one row for each child of the Time dimension. In denormalized form there will only two columns added for measures, one indicating the name of the measure and another for its value, regardless of the number of measures queried. The resulting dataset will have one row for each combination of measure and child of the Time dimension.
In a very loose way one can say that normalized minimizes row redundancy at the expense of the number of columns whereas denormalized form limits the number of columns at the expense of row redundancy.
So, in normalized form, the result of the above query is:
| [Time].[(All)] | [Time].[Years] | [Measures].[Sales] | [Measures].[Quantity] |
|---|---|---|---|
| All Years | 2004 | 4750205.889999998 | 47151 |
| All Years | 2005 | 1513074.4600000002 | 14607 |
| All Years | 2003 | 3573701.2500000023 | 35313 |
In denormalized form, the same query returns:
| [Time].[(All)] | [Time].[Years] | [Measures].[MeasuresLevel] | Measure |
|---|---|---|---|
| All Years | 2003 | Quantity | 35313 |
| All Years | 2004 | Quantity | 47151 |
| All Years | 2005 | Quantity | 14607 |
| All Years | 2005 | Sales | 1513074.4600000002 |
| All Years | 2004 | Sales | 4750205.889999998 |
| All Years | 2003 | Sales | 3573701.2500000023 |
Parameters are variables that can be passed on to the query, for example to create dynamic queries depending on user input.
The query
select o.YEAR_ID, o.STATUS, sum(o.TOTALPRICE) as price from orderfact o
where o.STATUS = ${status} and o.ORDERDATE > ${orderDate}
group by o.YEAR_ID, o.STATUS
Uses two parameters, status and orderDate. Parameters are always passed to the query as ${parameterName}.
The definition of parameters is done inside a `Parameters block, where the various parameters are defined. For each parameter the type and default attributes must be specified. A third attribute, pattern must be specified for parameters of the date type.
Example:
<Parameters>
<Parameter name="status" type="String" default="Shipped"/>
<Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>
</Parameters>
The Columns block supports renaming of columns and defining of new columns, using formulas.
Each column may be renamed using a Column block, by specifying its name:
<Column idx="1">
<Name>Year</Name>
</Column>
Calculated columns are added by a CalculatedColumn block, where the new column's name and the formula to use are defined. Each calculated column requires two properties: Name (the name that will be output by CDA), and Formula (the column's definition itself).
<CalculatedColumn>
<Name>PriceInM</Name>
<Formula>=[price]/1000000</Formula>
</CalculatedColumn>
Formulas are written in Open Formula format.
The Output tag define which columns to output and by which order. If not specified all columns from the query will be output in the same order as defined on the query. Output requires only one attribute, the indexes to be output. The order of the output columns is defined by the order in which the indexes are specified.
<Output indexes="1,0,3"/>
Compound data accesses combine the results of two queries. They're defined by a CompoundDataAccess block. There are two types of compound data accesses supported: union and join.
Unions take the results of two queries with the same number of columns and returning the rows from both data accesses on top of each other.
Joins take the results of two queries and join them by a specified set of keys. The result has the columns of both queries. The join performed is full outer.
A compound data access requires only two attributes:
If the two source data accesses have parameters these must also be defined in the compound data access, inside a Parameters block. Compound data accesses may also have Name, Columns and Output blocks.
Join compound data accesses require that both the Left and Right data accesses be specified (by id) as well as the keys (column ids on the source queries) by which to join.
<CompoundDataAccess id="3" type="join">
<Left id="1" keys="0,1"/>
<Right id="2" keys="0,1"/>
<Parameters>
<Parameter name="status" type="String" default="Shipped"/>
<Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>
</Parameters>
<Output indexes="0,1,2,5,6"/>
</CompoundDataAccess>
Union compound data accesses require the definition of Top and Bottom queries (by id).
<CompoundDataAccess id="3" type="union">
<Top id="2"/>
<Bottom id="1"/>
<Parameters>
<Parameter name="year" type="Numeric" default="2004"/>
</Parameters>
</CompoundDataAccess>
CDA is called by http requests to BASE_URL/pentaho/content/cda/.
The following methods are currently supported:
Some of the methods listed above require that the CDA file be specified on the URL. This can be done in two different ways:
Example:
solution=bi-developers&path=cda/cdafiles&file=compoundUnion.cda
is the same as
path=bi-developers/cda/cdafiles/compoundUnion.cda
All methods listed above (except doQuery and clearCache) output the results as a JSON formatted object, with two key values: resultset with the data itself and metadata with info about column indexes, names and data types.
Example of a listQueries output:
{"resultset":[
["3","","union"],
["2","Sql Query on SampleData","sql"]],
"metadata":[
{"colIndex":0,"colType":"String","colName":"id"},
{"colIndex":1,"colType":"String","colName":"name"},
{"colIndex":2,"colType":"String","colName":"type"}]}
clearCache returned the string "Cache cleared".
doQuery returns the query results in one of four formats: JSON, XML, XLS or CSV. The default is JSON.
doQuery establishes the necessary connections and performes the query specified by the dataAccessId parameter.
You may specify the values of the different parameters used by the query (if not specified default values defined on the CDA file will be used). Parameters are passed on the URL as paramParameter, where Parameter is the parameter's name.
Example:: if the data access with id=1 of the file $SOLUTION_REPOSITORY/solution/cda/file.cda has a parameter called Status and you want the query to use Shipped as the parameter the URL to use is
$BASE_URL/pentaho/content/cda/doQuery?path=solution/cda/file.cda&dataAccessId=1¶mStatus=Shipped
doQuery has an optional argument, outputType where the format of the returned data is specified as either `json, xml, csv or xls. The default value is json.
This method returns a JSON formatted list of all public queries in the specified CDA file.
Returns a JSON formatted list of all CDA files in the solution repository. Doesn't take any arguments.
Returns a list of parameters used by a data access. Requires both the path and dataAccessId to be specified.
Clears the contents of the cache. Takes no arguments.
Returns a JSON formatted list with all supported data access types and connections and their definitions.
The CDA previewer is an application that allows CDA queries to be returned as HTML and displayed on table form. Parameters can be changed using input boxes and a dropdown menu allows changing the dataAccessId.
The previewer is called from the URL
Example:
$BASE_URL/pentaho/content/cda/previewQuery?path=solution/cda/file.cda
specifying the filename in the path argument.
An embedded editor allows editing CDA files under the scope of Pentaho server, avoiding the need to use external editors and manually refreshing the solution repository. This only works on some browsers, like firefox or chrome.
The editor is called from the URL
Example:
$BASE_URL/pentaho/content/cda/editFile?path=solution/cda/file.cda
specifying the filename in the path argument. The editor does not allow creating new CDA files. You must manually create a blank CDA file on the solution repository and refresh the solution repository to be able to edit the file.
See the solution repository of your Pentaho installation for sample CDA files.
For more information about CDA, contact us at imagine@webdetails.pt or visit our website