In the last post, I explained the difference between SMP and NUMA architectures as we enter the “many-core” era. I also asked the following question: “Is it reasonable to expect massive performance improvements when you run an existing application on new NUMA-enabled hardware?” The answer is yes. However, improved performance is not guaranteed and you must be prepared to rewrite the code of your application to get the best out of many-core hardware.

Quartet FS’ R&D team decided to re-architect, from the ground up, its core in-memory aggregation engine ActivePivot to make it fully “NUMA-aware”. Today, version 5 of ActivePivot is a living proof that scalability can be taken to a whole new level in the many-core era. In today’s post, I will share the results of a benchmark performed to substantiate this statement.

The NUMA trap

NUMA is a memory design that accommodates the expansion in the number of cores running on modern hardware servers. Software that was not written for NUMA in the first place will perform poorly on a machine equipped with hundreds of processors. The reason for this is explained in the following paragraphs.

NUMA is based on splitting memory and processors across several nodes. Under NUMA, a processor has access to local memory on the same node and to remote memories on the other nodes. Latency issues arise from cross-node data movements. Naturally, it takes more time for a processor to access memory on a remote node than to use local memory. If your Java application is not optimized for a NUMA architecture, in-memory data will be spread randomly across all memory nodes without taking the business and application logic into consideration. This is the default behavior of the Java Virtual Machine. It is therefore highly probable that processors will need to access data stored in remote NUMA nodes in order to complete a query. So even though multi-threading was a key issue when you wrote your code, you must still undertake significant redesign efforts to ensure that it can take full advantage of many-core architectures.

Let us consider the example of a cross-currency aggregation program that delivers the total P&L value of a given portfolio in US dollars. There are two processors and each is in charge of a distinct set of queries. Processor A sums the P&L value of all trades in EUROS. Processor B sums the P&L value of all trades in GBP. Processor B is also in charge of applying the USD conversion rate to all intermediate results before summing them and delivering the total P&L value in USD. For the sake of simplicity, Graph 1 illustrates only the first step of the process.

Symmetric Multi-Processing (SMP)

Under SMP, both processors access a single memory bank to collect the data they need to execute their own queries. Memory access is random, but it doesn’t affect performance because there is only one single, local memory node that both processors have access to.

How does a non “NUMA-aware” code behave?

NUMA without optimization

To address contention issues, larger systems increasingly rely on distributed architectures such as NUMA (Non-Uniform Memory Access). Let us see what happens if you execute that same program on NUMA-based hardware where trade data is split across two distinct memory banks.

Processor A must access memory on its local node to collect the P&L value of Trade 1 in EUROS. This will take about 150 nanoseconds. But since the P&L Value of Trade 2 in EUROS is located on a remote node, it will take 4 times more time for processor A to complete that part of the query. The same applies for processor B. In total, 1,500 nanoseconds will be required to complete both queries. The reason is simple: the Java Virtual Machine has distributed trade data randomly on all NUMA nodes, being oblivious to the business logic that pertains to the query.

Minimizing cross-node data movements

This simple example emphasizes the importance of minimizing the number of accesses to remote memory. The key strategy is to make sure that each processor can use the memory on its own node and needs to share almost nothing with other processors. To be able to do this, we entirely redesigned our in-memory aggregation software. We redesigned its code to ensure that it could take full advantage of distribution in a NUMA environment.

So how does it look with ActivePivot version 5 running on that same NUMA-enabled hardware?

NUMA aware

ActivePivot 5 is able to position memory exactly “at the right place”, depending on the business logic that has to be applied. We found a way to physically map each data partition right next to the processor that will process that partition. Because ActivePivot 5 has full control over memory allocation, it minimizes data movements across nodes and gets the best possible bandwidth in a NUMA environment. This translates into massive performance gains.

On Graph 3, all trades in a given currency are grouped on the memory bank that is the closest to the processor that executes the summing in that currency. It takes 2×150 nanoseconds for each processor to complete its respective query, bringing the total query time down to 600 nanoseconds! Because ActivePivot is able to bind the threads, it executes a query 250% faster than when memory is let free.

Beyond the explanatory value of the above example, NUMA really boils down to the scalability gains which can be expected from many-core hardware systems. If you run a business, you will probably want to have the following question answered: “If data volumes double, can performance be kept at the same level by simply doubling the hardware resources?”

Testing scaleout on a many-core machine

In response to this question, I would like to share with you the result of some benchmark that we ran on a NUMA server called the ‘Bullion’, a 160-core machine made of four commodity servers of one terabyte each. We selected four MDX queries to calculate the P&L value. The first query ran on the whole dataset, while the other three ran on a limited scope which was defined by the number of trading desks (either one or two) and the number of historical dates. When using the Bullion at its full capacity, ActivePivot aggregated 600 million trades.

Gustafson chart

The first benchmark we ran was Gustafson’s test to assess if performance stays linear when data volumes and hardware resources double. We started with a dataset that fits in one terabyte and we used only one node out of the four. Then we doubled the dataset and used two nodes. Then we doubled the dataset again and used all four nodes. Query response times are almost flat, as shown on graph 4.

This proves that 160 cores can work simultaneously without suffering from any contention. When volumes double, ActivePivot maintains the service level by adding more resources without changing a single line of code.

The second benchmark we ran is the Amdahl test: What happens if I use more computing power to process the same volume of data?

Amdahl chart

We observed that when running a query on the full dataset, ActivePivot runs three times as fast on a machine that is four times bigger. As we grow the hardware, ActivePivot is able to take advantage of the increased number of threads and to maximize memory throughput. These results demonstrate the efficiency of ActivePivot on NUMA: Version 5 of ActivePivot is able to scale out on some of the biggest commodity hardware available today.

Conclusions to take away: You must be prepared to re-design the code for your applications for NUMA, even if it was written to work on multi-threaded hardware in the first place. If this is done correctly, speed-up and scale-out can both be achieved. This is easier said than done and you should expect to invest significant effort, while breaking away from old paradigms. However, the effort is definitely worth it: Hardware equipped with modern NUMA interconnect mechanisms, such as ‘Bullion’, are already available on the market today. So don’t miss the opportunity to deliver a quantum leap in performance for your mission-critical applications. Note: These tests were executed on a Bullion machine equipped with 4 NUMA nodes and 16 sockets, each one having 1 E7-4850 Xeon (10 physical cores, 2.0 GHz) and 256 GiB of RAM (1067 MHz DDR3).