Historical data analysis is typically enabled using data duplication technologies. But is this method still valid today when users need to analyze historical data that’s moving fast and changing rapidly throughout the day? All we know is that in ActivePivot, we practically had to re-invent our core database to support the requirements of our customers who wanted to travel back in time and analyze large volumes of dynamic data.
How to perform historical analysis when data has changed?
Historical analysis is at the heart of Business Intelligence (BI). Whether business users want to analyze risk, revenues, prices, sales, web traffic, or deliveries, they rely on historical data to monitor performance trends and understand how today’s figures compares with those of the past.
Recent use cases highlight the need for continuous insights into business indicators that are based on fast-changing data. Such insights enable quickly detecting deviations from plans – for example when average delivery times are longer than those defined in the SLA; when a trader’s portfolio loss is above the maximum authorized limit; or when the price of a star product has been higher than the competitor’s for more than 3 hours.
The requirements for operational analytics are driving a new generation of data aggregation platforms that combine analytical and transactional processing, or as labeled by Gartner, “Hybrid Transaction/Analytical Processing” (HTAP).
Nevertheless, real-time-analytical capabilities do not mean that historical analysis is not needed anymore. Typically, by the time a user starts handling an event, data has already completely changed. Consider the risk manager of the bank: A limit breach happens at 10:00am but the risk manager is able to look at it only at 11:00am. In order for him to understand the root cause of the breach, he needs to look at the data exactly as it was at the time of the limit breach. Yet between 10:00 and 11:00, traders have booked new transactions and the market has changed. Therefore, the ability to perform historical analysis poses a complex technical challenge.
The technical challenge behind historical analysis
Enabling historical analysis is typically done by duplicating the data once a day and performing analytics on the duplicated set of data. Every day, data is being copied and stamped with its associated date. While this approach is well suited for use cases involving day+1 reporting, it does not address operational analytics requirements with fast changing data. Because indicators are continuously recomputed as data keeps changing, copying the whole data set every time ‘something happens’ would result in unrealistic data volumes.
Database management systems already include concurrency control methods that enable restoring a previous state of the database. However, this requires so much manual administration work making it unrealistic with operational analytics use cases.
So is there a way to go back in time right in the database itself and make it suitable for operational analytics?
Reinventing the ‘MV’ in Multi-version Concurrent Control
In ActivePivot, MVCC was initially implemented to decouple queries from transactions and support complex queries that execute on data that changes in real-time. To refrain from data duplication, MVCC was based on ‘delta’ structures: a new version of the data is characterized by the difference – or the delta – from its previous version.
This design proved successful because it was memory-efficient and provided great performance on the most recent version of the data. But soon enough, our clients realized that several versions of the data could exist at the same time within ActivePivot. One of these clients, the biggest commodity exchange in the world, asked if we could keep all versions of the data for the current day. This meant keeping one thousand versions to ensure that risk analysts could go back in time and perform effective root cause analysis on outliers that happened earlier during the day.
This made us realize that we had to overcome a shortcoming of delta structure: namely, when you read data in the past, you have to go through several “hops”, resulting in performance issues. To make a long story short, this resulted in a one-year R&D project where we practically rewrote our MVCC engine from scratch.
‘Time travel’ use cases
Using our new MVCC implementation, ActivePivot now allows you to go back in time and analyze data snapshots at any instance in time, ranging from 3 seconds ago to the start of the day. Acting as a data time machine, ActivePivot instantly recomputes all indicators as they were exactly at a specific point in the past. This allows you to pinpoint a cause of an event – even when this entails large data volumes with high-frequency data streams.
This paves the way for new use cases. Risk analysts can ask the question “What did the market look like as of 8:45am this morning?” ActivePivot recalculates all measures, using the underlying data with what the trades, the positions, or the prices looked like as of 8:45am. Supply chain managers can now understand the combination of events or factors that led to the unexpected delivery delay that occurred at 10:00am this morning, retracing how this delay propagated throughout the day.
Do you wish to see the data time machine in action rather than in words ? Feel free to schedule an ActivePivot demo. Or you can view a 20-minute presentation from our User Group conference discussing this topic.
1. Gartner Research note: ‘Hybrid Transaction/Analytical Processing Will Foster Opportunities for Dramatic Business Innovation’ by Massimo Pezzini | Donald Feinberg | Nigel Rayner | Roxane Edjlali – 28 January 2014