Community Data Access (CDA) was designed to allow greater flexibility for datasources.
Most tools available can perform queries over a single datasource and there's no easy way to join data coming from different databases or in different languages (joining data from a SQL query with data from a MDX query). These tasks usually require an independent ETL job to synchronize different data sources and build a new database. CDA was developed as an abstraction tool between database connections and CDF (Community Dashboard Framework).
It allows data to be retrieved from multiple data sources and combined in a single output that can easily be passed on to dashboard components. It serves three main purposes:
- To allow joining data from different sources;
- To avoid SQL injection problems within CDF;
- To allow an easier way to export data from queries.
Wondering who can provide you a fitting dashboard design? Consider Webdetails.
CDA uses two different components: the
connection
(the database or Pentaho datasource to use) and the
dataAccess
(a query over that connection).
Connections and queries are defined in a XML file (the CDA file) lying in the solution repository. Data access is done by a call to a specific dataAccess id in the CDA file.
Each data access may have parameters and calculated columns added to the results. It's also possible to have compoundDataAccess elements, which represent joins or unions on different queries.
CDA features an API to fetch query results, an editor to edit CDA files and a previewer to visualize the results of a query in table form.
Finally, export: query results can be returned from the API in various formats. Currently supported formats are JSON, XML, CSV, XLS and HTML visual mode.
A CDA file is a XML file defining the database connections and the queries over them.
All its content must be enclosed in a
CDADescriptor
block.
<?xml version="1.0" encoding="utf-8"?>
<CDADescriptor>
(...)
</CDADescriptor>
Inside the main block we define datasources (the database connections), data accesses (the queries themselves) and compound data accesses (joins and unions of queries).
The CDA file can now also be written from inside the CDE just by defining all the
datasources
,
data accesses
and
compound data accesses
, which is simpler than writing all the code.
Connections are defined inside a
DataSources
block. Each connection is defined by a
Connection
block which contains all the necessary information. For each connection an (unique)
id
and
type
must be provided.
Supported Connection Types
-
metadata.metadata;
-
sql.jdbc;
-
sql.jndi;
-
mondrian.jdbc;
-
mondrian.jndi;
-
olap4j.jdbc;
-
olap4j.jndi;
-
scripting.scripting;
-
kettle.TransFromFile;
-
xpath.xPath.
olap4j
,
scripting
,
xpath
and
kettle
connections are available only on CDA for Pentaho 3.6 onwards.
A connection defines the datasource and the way the connection is established.
Each connection type requires a different set of properties to be defined.
metadata.metadata
-
Name
: the name of the connection.
-
XmiFile
: the file containing the metadata definitions, exported from the Metadata repository.
-
DomainId
: the Metadata domain as defined on the Pentaho metadata layer.
sql.jdbc
-
Name
: the name of the connection.
-
Driver
: the Java classname of the driver (e.g.
com.mysql.jdbc.Driver
).
-
URL
: the URL to connect to (e.g.
jdbc:mysql://localhost:3306/database
).
-
User Name
: the username to use
-
Password
: the user's password
sql.jndi
-
Name
: the name of the connection.
-
Jndi
: the connection's parameters as defined in the context.xml file.
mondrian.jdbc
-
Name
: the name of the connection.
-
Mondrian schema
: mondrian schema file.
-
Driver
: the Java classname of the driver (e.g.
com.mysql.jdbc.Driver
).
-
URL
: the URL to connect to (e.g.
jdbc:mysql://localhost:3306/database
).
-
User Name
: the username to use.
-
Password
: the user's password.
-
Banded Mode
: compact or classic.
mondrian.jndi
-
Name
: the name of the connection.
-
Mondrian schema
: mondrian schema file
-
Jndi
: the connection's parameters as defined in the context.xml file.
-
Banded Mode
: compact or classic. There is also the possibility of a denormalized
mondrian.jdbc
and a denormalized
mondrian.jndi
MDX queries that have the same options as the standard (normalized) queries. The differences between denormalized and standard (normalized) queries are explained on data accesses. The differences between compact and classic on Banded Mode are also explained on data accesses.
olap4j.jdbc
-
Name
: the name of the connection.
-
Driver
: the olap4j driver to use (typically
mondrian.olap4j.MondrianOlap4jDriver
).
-
Url
: the connection url, typically
jdbc:mondrian:
.
-
User name for the underlying database
: the username to use.
-
Password for the underlying database
: the user's password.
-
Driver for the underlying database
: the Java classname of driver (e.g. org.hsqldb.jdbcDriver).
-
URL pointing at the underlying database
: the database url (e.g.
jdbc:hsqldb:hsql://localhost:9001/sampledata
).
-
Catalog to apply to the underlying database
: path to the the Mondrian schema file (e.g.
solution:/steel-wheels/analysis/steelwheels.mondrian.xml
).
olap4j.jndi
-
Name
: the name of the connection.
-
Jndi
: the connection's parameters as defined in the context.xml file
There is also the possibility of a denormalized
olap4j.jdbc
and a denormalized
olap4j.jndi
MDX queries that have the same options as the standard (normalized) queries.
scripting.scripting
-
Name
: the name of the connection.
-
Language
: the scripting language used; only beanshell is currently supported.
-
InitScript
: the initialization script to be run before the query.
kettle.TransFromFile
-
Name
: the name of the connection.
-
Kettle Transformation File
: the kettle transformation file to use.
-
Variable
: he value of a variable to be used in the transformation; multiple variable properties may be defined, one for each variable to use
xpath.xpath
-
Name
: the name of the connection.
-
DataFile
: a XML file with the data to query.
Examples
You can find examples of all these types of connection in your solution repository, under
bi-developers/cda/cdafiles
.
Simple Data Accesses
A simple data access is a query over a connection. It's defined by a DataAccess block. CDA supports the following types of data accesses.
Inside a
DataAccess
block the query is defined inside a
Query
block, containing the query code.The following attributes must be defined:
-
Access Level
: public or private; public data accesses are available from outside calls, while private data accesses can only be called from other data accesses (e.g. compound data accesses);
-
Parameters
: lists the parameter's name, default value (the default value if the parameter value isn't specified when the data access is called) and type (String, Integer, Numeric, Date, StringArray, IntegerArray, NumericArray and DateArray);
-
Output
: which columns to output and by which order;
-
Output Mode
: the column's output mode, include or exclude the columns set above;
-
Columns
: names of the columns;
-
Calculated Columns
: the columns to be calculated by a given formula;
-
Query
: the query itself;
-
Cache
: the possibility to cache the results, either true or false (optional, defaults to true);
-
Cache Duration
: time to keep the results in cache, in seconds (optional, defaults to 3600).
Reference information about the different query types
Banded Mode
When setting a Mondrian connection there is the possibility to choose between
compact
or
classic
Banded Mode.
This option can make a lot of difference in the output of the results. The following query will show the
Sales
from a given
Market
at a given
Time
:
select NON EMPTY {[Measures].[Sales]} ON COLUMNS,
NON EMPTY ({[Markets].[EMEA].Children}) ON ROWS
from [SteelWheelsSales]
WHERE [Time].[2005]
In
compact
mode this query will output one column for
Sales
and one column for the selected
Market
, as shown below:
| Markets |
Sales |
| Austria |
68250.26000000001 |
| Belgium |
25040.11 |
| Denmark |
26012.870000000003 |
| Finland |
126851.70999999996 |
| France |
242956.4 |
| Italy |
41509.94 |
| Spain |
326798.16999999987 |
| Sweden |
31606.72 |
| UK |
40802.810000000005 |
However, if you choose the classic mode, the same query returns the full mondrian hierarchy:
| [Markets].[(All)] |
[Markets].[Territory] |
[Markets].[Country] |
[Measures].[Sales] |
| All Markets |
EMEA |
Austria |
68250.26000000001 |
| All Markets |
EMEA |
Belgium |
25040.11 |
| All Markets |
EMEA |
Denmark |
26012.870000000003 |
| All Markets |
EMEA |
Finland |
126851.70999999996 |
| All Markets |
EMEA |
France |
242956.4 |
| All Markets |
EMEA |
Italy |
41509.94 |
| All Markets |
EMEA |
Spain |
326798.16999999987 |
| All Markets |
EMEA |
Sweden |
31606.72 |
| All Markets |
EMEA |
UK |
40802.810000000005 |
Normalized vs. Denormalized MDX
CDA allows two different types of MDX queries: standard (normalized) or denormalized.It's not within the scope of this document to elaborate over the details of data normalization/denormalization, but the following example can shed some light over the differences between the two:
select
{[Measures].[Sales], [Measures].[Quantity]} ON COLUMNS,
NON EMPTY [Time].Children ON ROWS
from [SteelWheelsSales]
where ([Order Status].[${status}])
This query will generate one column for each measure in normalized form and have only one row for each child of the Time dimension. In denormalized form there will be only two columns added for measures, one indicating the name of the measure and another for its value, regardless of the number of measures queried.
The resulting dataset will have one row for each combination of measure and child of the Time dimension.
In a very loose way one can say that normalized minimizes row redundancy at the expense of the number of columns whereas denormalized form limits the number of columns at the expense of row redundancy.
So, in normalized form, the result of the above query is:
| [Time].[(All)] |
[Time].[Years] |
[Measures].[Sales] |
[Measures].[Quantity] |
| All Years |
2004 |
4750205.889999998 |
47151 |
| All Years |
2005 |
1513074.4600000002 |
14607 |
| All Years |
2003 |
3573701.2500000023 |
35313 |
In denormalized form, the same query returns:
| [Time].[(All)] |
[Time].[Years] |
[Measures].[MeasuresLevel] |
Measure |
| All Years |
2003 |
Quantity |
35313 |
| All Years |
2004 |
Quantity |
47151 |
| All Years |
2005 |
Quantity |
14607 |
| All Years |
2005 |
Sales |
1513074.4600000002 |
| All Years |
2004 |
Sales |
4750205.889999998 |
| All Years |
2003 |
Sales |
3573701.2500000023 |
Parameters
Parameters are variables that can be passed on to the query, for example to create dynamic queries depending on user input.The query uses two parameters, status and orderDate. Parameters are always passed to the query as
${parameterName}
.
select o.YEAR_ID, o.STATUS, sum(o.TOTALPRICE) as price from orderfact o
where o.STATUS = ${status} and o.ORDERDATE > ${orderDate}
group by o.YEAR_ID, o.STATUS
The definition of parameters is done inside a
Parameters
block, where the various parameters are defined. For each parameter the
type
and
default
attributes must be specified. A third attribute,
pattern
must be specified for parameters of the date
type
.
-
type
: String, Integer, Numeric (double), Date, StringArray, IntegerArray, NumericArray and DateArray;
-
default
: the default value if the parameter value isn't specified when the data access is called;
-
pattern
: date format. All formats accepted by the SimpleDateFormat Java class are valid.
Check the example below:
<Parameters>
<Parameter name="status" type="String" default="Shipped"/>
<Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>
</Parameters>
Columns
The Columns block supports renaming of columns and defining of new columns, using formulas. Each column may be renamed using a Column block, by specifying its name:
<Column idx="1">
<Name>Year</Name>
</Column>
Calculated columns are added by a
CalculatedColumn
block, where the new column's name and the formula to use are defined.
Each calculated column requires two properties:
Name
(the name that will be output by CDA), and
Formula
(the column's definition itself).
Formulas are written in Open Formula format.
<CalculatedColumn>
<Name>PriceInM</Name>
<Formula>=[price]/1000000</Formula>
</CalculatedColumn>
Output
The Output tag define which columns to output and by which order. If not specified, all columns from the query will be returned in the same order as defined on the query.
Output requires only one attribute, the indexes to be returned. The order of the output columns is defined by the order in which the indexes are specified.
<Output indexes="1,0,3"/>
- sql;
- mql;
- xPath;
- mdx;
- olap4j;
- kettle;
- denormalizedMdx;
- scriptable.
Compound Data Accesses
Compound data accesses combine the results of two queries. They're defined by a
CompoundDataAccess
block.
There are two types of compound data accesses supported:
union
and
join
.
Union
Unions take the results of two queries with the same number of columns and return the rows from both data accesses on top of each other.
A union compound data access requires only these attributes:
-
Name
: the name of the compound query;
-
Top
: the id of the query that will stay on top;
-
Bottom
:the id of the query that will stay on the bottom;
-
Parameters
: lists the parameter's name, default value (the default value if the parameter value isn't specified when the data access is called) and type (String, Integer, Numeric, Date, StringArray, IntegerArray, NumericArray and DateArray) that are passed on to the compound query.
If the columns from both queries have different names, it will prevail the names of the query that is on top. Take a look at the example below.
<CompoundDataAccess id="3" type="union">
<Top id="2"/>
<Bottom id="1"/>
<Parameters>
<Parameter name="year" type="Numeric" default="2004"/>
</Parameters>
</CompoundDataAccess>
Join
Joins take the results of two queries and join them by a specified set of keys. The result has the columns of both queries. The join performed is full outer.
Join compound data accesses require that both the
Left
and
Right
data accesses be specified (by id) as well as the
keys
(column ids on the source queries) by which to join.
A join compound data access requires the following attributes:
-
Name
: the name of the compound query;
-
Left
: the first query;
-
Left Keys
: the id or ids of the columns from the first query that are common to the second query;
-
Right
: the second query;
-
Right Keys
: the id or ids of the columns from the second query that are common to the first query;
-
Parameters
: lists the parameter's name, default value (the default value if the parameter value isn't specified when the data access is called) and type (String, Integer, Numeric, Date, StringArray, IntegerArray, NumericArray and DateArray) that are passed on to the compound query;
-
Output Options
: the ids of the columns to output from both queries by order, starting with the columns from the left query and them the ones from the right query;
-
Output Mode
: the column's output mode, include or exclude the columns set above The columns that are common in both queries must be of the same type. Take a look at the example below.
The columns that are common in both queries must be of the same type. Take a look at the example below.
<CompoundDataAccess id="3" type="join">
<Left id="1" keys="0,1"/>
<Right id="2" keys="0,1"/>
<Parameters>
<Parameter name="status" type="String" default="Shipped"/>
<Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>
</Parameters>
<Output indexes="0,1,2,5,6"/>
</CompoundDataAccess>
CDA is called by HTTP requests to
BASE_URL/pentaho/content/cda/
.
Methods Supported
- doQuery;
- listQueries;
- getCdaList;
- listParameters;
- clearCache;
- listDataAccessTypes.
CDA File
Some of the methods listed above require the CDA file to be specified in the URL.This can be done in two different ways:
- providing the
solution
folder, the
path
to the file from the solution and the name of the CDA file.
- providing the full
path
from the solution repository.
Check the example below.
solution=bi-developers&path=cda/cdafiles&file=compoundUnion.cda
is the same as
path=bi-developers/cda/cdafiles/compoundUnion.cda
Output
All methods listed above (except doQuery and clearCache) output the results as a JSON formatted object, with two key values: resultset with the data itself and metadata with info about column indexes, names and data types.
Check the example of a listQueries output:
{"resultset":[
["3","","union"],
["2","Sql Query on SampleData","sql"]],
"metadata":[
{"colIndex":0,"colType":"String","colName":"id"},
{"colIndex":1,"colType":"String","colName":"name"},
{"colIndex":2,"colType":"String","colName":"type"}
]}
clearCache
returned the string "Cache cleared".
doQuery
returns the query results in one of four formats: JSON, XML, XLS or CSV. The default is JSON.
Method Reference
doQuery
doQuery
establishes the necessary connections and performs the query specified by the dataAccessId parameter.
You may specify the values of the different parameters used by the query (if not specified default values defined on the CDA file will be used). Parameters are passed to the URL as paramParameter, where Parameter is the parameter's name.
Check this example: if the data access with id=1 of the file
$SOLUTION_REPOSITORY/solution/cda/file.cda
has a parameter called Status and you want the query to use Shipped as the parameter the URL to use is:
$BASE_URL/pentaho/content/cda/doQuery?path=solution/cda/file.cda&dataAccessId=1¶mStatus=Shipped
doQuery
has an optional argument,
outputType
, where the format of the returned data is specified as either
json
,
xml
,
csv
,
xls
or
html
. The default value is
json
.
listQueries
This method returns a JSON formatted list of all public queries in the specified CDA file.
getCdaList
Returns a JSON formatted list of all CDA files in the solution repository. Doesn't take any arguments.
listDataAccessTypes
Returns a JSON formatted list with all supported data access types and connections and their definitions.
listParameters
Returns a list of parameters used by a data access. Requires both the path and dataAccessId to be specified.
The CDA previewer is an application that allows CDA queries to be returned as HTML and displayed on table form.
Parameters can be changed using input boxes and a dropdown menu allows changing the dataAccessId. The previewer is called from the URL.
Check the example below.
$BASE_URL/pentaho/content/cda/previewQuery?solution='path'&path=&file='filename'.cda
Specifying the path argument as the directory inside the solution directory where the CDA file is, specified in filename by the name of the CDA file.
An embedded editor allows editing CDA files under the scope of the Pentaho server, avoiding the need to use external editors and manually refreshing the solution repository.
This only works on some browsers, like Firefox or Chrome. The editor is called from the URL. Check the example below.
$BASE_URL/pentaho/content/cda/editFile?solution='path'&path=&file='filename'.cda
Specifying the path argument as the directory inside the solution directory where the CDA file is, specified in
filename
by the name of the CDA file.
The editor does not allow creating new CDA files. You must manually create a blank CDA file on the solution repository and refresh the solution repository to be able to edit the file.
See the solution repository of your Pentaho installation for sample CDA files.