In previous posts, we’ve delved into the principles of multidimensional databases. Among all the benefits that a multidimensional database delivers is complex aggregation, a process by which KPIs are written once and are immediately available across any dimensions, through any filtering, letting the user follow his train of thought.
But how does complex aggregation actually work? This post explores a concrete use case, articulates the technology challenges behind complex aggregation and demonstrates why ETL and SQL relational databases are not a fit.
What is Complex Aggregation?
At the mention of ‘complex aggregation’, statistical algorithms and stochastic models may spring to mind. But there is no need to go that far to identify a calculated measure that has bugged many analytics projects. Think of cross-currency aggregation that calculates the grand total of sales in different currencies. Despite the apparent simplicity this involves more than just summing up numbers and actually requires intense computing.
Let’s explore why.
Consider a large ecommerce site with millions of visitors from all over the world, settling their transactions in their home currency. The Sales VP of this business would probably be interested in getting a consolidated number representing the total value of sales in one single currency, say in US Dollars (USD).
He will break that grand total across various dimensions, such as the countries where the transactions occurred, the products that were purchased or the time of the purchase. The aggregation function underlying those metrics is just a sum. Just that summing up amounts in different currencies requires applying foreign exchange rates to convert all numbers into a single currency before aggregating them.
How does this use case translate into technical terms?
Using ETL Tools and SQL Queries
You may use an ETL tool to enrich your source data with an additional field storing the sales amount in USD, the reference currency. The first limitation of this approach is that once you’ve settled for a target currency, for instance USD, you will find it cumbersome to change it to another currency, and even more painful to support several reference currencies.
Another limitation is soaring data volumes: Having the amounts both in the native currency and in USD actually means twice as much data, and when moving a billion transactions through the ETL, you will regret the waste of time and resources.
Last but not least, consider data freshness. As you convert all amounts at a specific point in time, they will quickly become obsolete due to the volatile nature of foreign exchanges rates. So ETL is not a suitable approach to effectively process large data sets that keep changing.
Using a relational database represents a slight improvement. Together with the table where sales are stored in their native currency, you create a small table containing the exchange rates. Then using a SQL query, you ‘join’ the tables and for each sale compute the amount in the reference currency. On top of the join, you close the loop with a ‘group by’ that aggregates the data at various levels to obtain the desired outcome.
This approach is much more dynamic because conversion is done at query time. It is then possible to update the FX rate table and execute the SQL query again to refresh the aggregates. The drawback is that this process is compute-intensive. You’ll have to apply the same exchange rate to each individual record, which would not be the most efficient if you had one billion records.
This will result in unacceptable query times for the sales VP who uses analytics to run his business. Only a best performing SQL relational databases running in-memory would be able to provide acceptable times. And this would still be ineffective because dynamic analytics would not be supported.
Let’s go one step further. In our previous example, we ignored the fact that all amounts in a given currency can be consolidated in one intermediate number to which the FX rate is then applied. Doing this intermediate summing is easy and more effective, because you just need to add numbers from the same column and there is no join involved. Then if the sales amounts are broken down in 10 currencies, you apply the conversion rate 10 times only instead of 1 billion times.
But while an SQL adept would be able to write an optimized query using the intermediate total, this is not the end goal. The grand total is only the starting point of the analysis. What if the sales VP wants to break down this total by country or by product or by seasonality? The query needs to be rewritten. It gets even worse in the case of filtering – for instance if the user wants to consider only on those countries that represent 80% of the revenue. SQL experts would then need to rewrite the SQL, test it, debug it and profile it until it delivers the requested filtering with decent performance levels.
Beyond the cost of frequent re-writes, this approach restricts the scope of business users’ analysis to pre-canned queries preventing dynamic analytics, which is typically an integral part of operational decision-making.
The bottom line is that an SQL relational database works well for static reporting but cannot deliver on the promise of real-time analysis.
Complex Aggregation in a Multidimensional Database
Calculated measures that run within a multidimensional database represent a much more simple and effective way to achieving dynamic aggregation.
With our ActivePivot multidimensional database we implemented a post-processor,written in Java that is evaluated at query time for each requested aggregate. Post-processors are extremely flexible. They can use pre-aggregated data, custom calculations, other post-processor results and/or any external resources to compute values. Using post-processors in a multidimensional database, a calculated measure will ‘magically’ apply to all the dimensions. There is nothing to rewrite or reconfigure if you add a new dimension or a new filter and the calculated measure is recomputed on the fly.
Going back to our example of cross-currency aggregation: the post processor will fetch sub-totals per currency in a split second, apply dynamic FX rates instantly, and complete the aggregation along the selected dimensions. The amount will be refreshed instantly if FX rates change. Last but not least, the sales VP can setup filters, drill in at any level of detail, add other dimension such as country, seasonality, or product and results will automatically be recomputed at lightning speeds.
This usage of dynamic calculation is the cornerstone of much more complex aggregations. In financial services for instance, dynamic aggregations are used to compute netting rules for credit risk calculations. In logistics, dynamic aggregations are used to compare planned versus actual SLAs at any time of the day.