Complex aggregation has become a common requirement for business users looking to analyze sophisticated metrics across multiple dimensions. There are numerous use cases for complex aggregation such as cross-currency aggregation, which was explored in our last post. Dynamic bucketing is another use case example.
This blog takes a deep look at the technical considerations for a successful implementation of dynamic bucketing.
What is Data Bucketing?
Data bucketing consists of breaking down a set of data according to a fact or an attribute. Consider the accounting book of a company: With more than three thousand days over the past ten years, a convenient way for the CFO to gauge sales volumes is to split all transactions by month, by quarter, by fiscal year etc. Put differently, data bucketing allows users to view large data volumes according to meaningful ranges that facilitate understanding.
Data bucketing may seem like a straightforward process. Indeed, if data is sorted according to an attribute with a limited number of possible values, such as grouping sales data by the five main geographical regions or by twenty product categories, there is no real technical challenge. However, complexity kicks in when users want to bucket data by selecting an attribute that has so many single values that it can’t be used as a discriminatory criterion.
…and what is Dynamic Data Bucketing?
Let’s take the case of the trading floor of an investment bank. Traders want to measure the risk of their portfolio depending on the maturity of assets: today, in three days, in a week, in one, ten or thirty years and so on. But with thousands of trades of various amounts every single minute, bucketing becomes a daunting task.
Or consider a web analytics system that analyzes customer behavior. Billions of clicks are recorded throughout a business day, with each click having a unique time stamp. Although the exact moment when clicks are registered is valuable information, it’s unusable as such. As a result, the business analyst wants to group all clicks by multiple time buckets, depending on the application he wants to use the data for: bucketing per minute in the case of dynamic pricing, bucketing by day in the case of historical trend analysis.
These examples have two points in common: First, the time stamp cannot be made sense of in its most granular form. Therefore business users need an analytical environment that allows them to split their data in a number of meaningful ranges regardless of the volume of data. Second, business users also want to be able to change the value of the intervals on the fly. The trader may want to look at the maturity of his portfolio in a time interval that starts at day + 1 and finishes in one year, but he may also want to view the maturity in a different time interval, for instance between day + 7 and the next 36 months.
In summary, whenever a business user wants to use a different interval, the analytical system needs to recompute new values. This means that the freedom of analysis expected by business users introduces an element of data bucketing complexity.
Implementing Dynamic Data Bucketing – the Alternatives
Using an ETL system might spring to mind as one possible method. However, ETL is incompatible with the concept of dynamic bucketing. Attributes are calculated for each unitary fact and are generally stored in a column. Although facts are enriched with additional dimensions that make up the buckets, the process is very static. Typically ETL won’t fit the bill if a user wants to use a bucket that has not been configured before the data has been loaded. Furthermore, if the source data comes in a large volume, the ETL will generate a load of calculated data, which will eventually impact performance.
It is impossible to anticipate every single possible bucket, let alone pre-calculate all buckets. Therefore you need a system that is capable of calculating buckets at query time and deliver updated results in a split second.
SQL databases are another option, which is more dynamic because bucketing is done at query time. An SQL database will generate as many queries as there are buckets, with each SQL query specifying the requested interval. So one bucket will generate one query, ten buckets will generate ten queries. An SQL database could be considered as an alternative in simple use cases – for instance when users keep looking at the same buckets. However, it is inefficient when users want to drill in data within a specific bucket or filter data. Let’s take the case of an operations manager at a courier company. In order to optimize capacity, the manager would like to view the total volume of delivered parcels split by the “weight” attribute and ask questions such as “How many delivered parcels weigh between and 100g and 500g? Between 100g and 200g? Between 200g and 300g?” and so on. Every single new query needs to be rewritten, tested and deployed again. This gets even worse in the case of filtering. Eventually, SQL databases cannot meet the needs of true multidimensional analysis.
Since traditional technologies fall short, you might want to look at in-memory, multidimensional databases, which provide a much more simple and effective way of achieving dynamic bucketing.
Buckets can be generated on the fly at query time using standard MDX queries. In this case, the bucket becomes a calculated member. Users can then apply other features that define multi-dimensional analysis to the bucket in order to do things such as filtering or calculating additional measures. User interfaces such as ActiveUI can automatically produce the MDX query.
A second method is using post processors. ActivePivot’s post-processors go one step beyond what you can achieve with calculated measures. In fact, they are probably the most efficient approach to dynamic bucketing.
First, post-processors are extremely fast and provide superior performance for bucketing billions of records to follow users’ train of thought. Performance is achieved in two manners. Post-processors are integrated in the core aggregation pipeline and therefore benefit from the optimization of ActivePivot’s multithreaded, in-memory engine. Then, buckets are dynamically generated at query time, so that every user can create his own buckets, depending on how he wants to look at the data.
Another advantage of post-processors is that they enable interacting with existing code and call other services in your organization, since they are implemented as bits of Java code. As a result, there is no limit to the business logic that can be injected in the post-processors. This means that dynamic and interactive bucketing is now within reach, for instance duration weighted sensitivity.
Interestingly enough, the most simple and straightforward use cases for analytics are often the most challenging ones. This is especially true of the analytics applications that are highly interactive, such as dynamic data bucketing. We have seen that neither ETL systems nor SQL databases can support the performance levels and dynamic, on-the-fly calculations required by dynamic data bucketing. In fact, interactive analytics applications require new technology approaches such as in-memory, multidimensional analytics technologies.