When your business is analyzing big data with the goal of providing answers in split seconds, you find yourself trying to squeeze every bit of speed into your solution. Among other things, this also includes finding the optimal processor architecture. This is why we’ve spent quite a lot of efforts studying the memory architecture alternatives – NUMA (non-uniform memory access) and SMP (symmetric multiprocessing) – to see which one could provide us with the best results.
NUMA is a computer memory design used in multiprocessing, which is based on splitting memory and processors across several nodes. NUMA is an interesting alternative to traditional SMP architectures, particularly when considering the inflation in the number of processors running on modern hardware servers. Of course, this assumes you succeed avoiding a few traps.
This post compares SMP and NUMA architectures, and provides some guidelines to take full advantage of NUMA as we enter the many-core era.
The beauty of resource sharing in the multicore era
Many multiprocessor systems today use Multi-Processing (SMP) architectures. In fact, SMP was one of the earliest styles of multiprocessor machine architectures. SMP involves a multiprocessor computer hardware and software architecture, in which two or more identical processors have a direct access to the memory, using a single shared system bus. All processors fetch data at the same speed and from any page of the memory.
Generally speaking, SMP systems involve a pool of homogeneous processors running independently. Each processor executes different programs and works on different data with capability of sharing common resources (memory, I/O device, interrupt system and so on). Because different programs can run on different CPUs simultaneously, processing speeds is dramatically improved.
From multicore to many-core: the limits of SMP
However, in cases where a large number of jobs are being processed in an SMP environment, a loss of hardware efficiency is often experienced. This is due to the fact that modern CPUs operate considerably faster than the main memory they use. Software programs are developed to schedule jobs so that the processor utilization reaches its maximum potential. But the best software package cannot overcome the fact that access to memory always remains serialized in SMP systems. This causes performance to deteriorate when additional processors are added to the system. The main reason is the limited bandwidth of the bus. So when processors race for memory, contentions are induced by the system bus overhead, which results in slowdowns.
SMP was typically used for building computers with up to 8 processors. But as we transition from the ‘multicore’ era into the ‘many-core’ era, where machines will contain hundreds of processors, the SMP architecture is being challenged. While it is always possible to build SMP architectures that will scale across time, hardware specialists are looking for more affordable, simple and elegant alternatives.
To address contention issues, larger systems increasingly rely on distributed architectures such as NUMA (Non-Uniform Memory Access).
NUMA is a computer memory design which allocates separate memory banks to each processor. By splitting memory and CPUs across different nodes, NUMA avoids the performance hit when several processors attempt to address the same memory, as happens in SMP-based systems. Under NUMA, a processor has access to local memory on the same node and to remote memories on the other nodes. Accessing local memory is obviously faster than accessing remote memory, hence the non-uniformity. Overall speed increase is the most obvious benefit of NUMA: In theory, NUMA can improve the performance over a single shared memory by a factor of roughly the number of separate memory banks.
Is NUMA an opportunity or a curse?
Although NUMA seems like a great opportunity, why is it that engineers typically experience sheer disappointment when they begin working with NUMA? Indeed, finding out that performance is cut in half when an application is run on a hardware machine that is twice as expensive is a valid cause for frustration.
The reason is that in NUMA, memory access time – and therefore speed – depends on the memory location relative to the processor that is using it. A processor that reads data from its home node gets the smallest latency and the maximum bandwidth. But when a processor reads data from a remote node, it goes through a special network between the nodes. The latency is higher, and the bandwidth reduced. In fact, the farther a processor is from memory, the slower the performance.
To benefit from its full potential, a NUMA architecture should be designed to minimize remote accesses so that processors can use the memory on their own node and share almost nothing with other processors. When an application running on a NUMA-based hardware is able to position memory exactly at the right place, performance gains are massive.
Yet, to achieve the increased speed, a significant application redesign is needed, particularly with Java-based applications. Software that was not written for NUMA in the first place will simply perform poorly. Preparing the code of an application for a usage in NUMA environment is the only way to ensure that it will take full advantage of distribution.
At Quartet FS, we’ve been researching NUMA for more than two years. We learned how NUMA hardware and operating systems behave and how to get the best out of it. Our in-memory Java-based aggregation technology, ActivePivot, has been redesigned to leverage this new architecture. By having full control over memory allocation, ActivePivot is now able to minimize cross-node data movements and get the best possible bandwidth in a NUMA environment. In the next post, I will share the results of some benchmarks that we ran on a 160-core NUMA server. Stay tuned!