Thursday, 21 May 2020

PostgreSQL on ARM


In my last blog, I wrote that applications that have been running on x86 might need to undergo some adaptation if they are to be run on a different architecture such as ARM. Let's see what it means exactly.

Recently I have been playing around with PostgreSQL RDBMS using an ARM64 machine. A few months back, I even didn't know whether it can be compiled on ARM, being oblivious of the fact that we already have a regular build farm member for ARM64 for quite a while. And now even the PostgreSQL apt repository has started making PostgreSQL packages available for ARM64 architecture. But the real confidence on the reliability of PostgreSQL-on-ARM came after I tested it with different kinds of scenarios.

I started with read-only pgbench tests and compared the results on the x86_64 and the ARM64 VMs available to me. The aim was not to compare any specific CPU implementation. The idea was to find out scenarios where PostgreSQL on ARM does not perform in one scenario as good as it performs in other scenarios, when compared to x86.


Test Configuration

ARM64 VM:
   Ubuntu 18.04.3; 8 CPUs; CPU frequency: 2.6 GHz; available RAM : 11GB
x86_64 VM:
   Ubuntu 18.04.3; 8 CPUs; CPU frequency: 3.0 GHz; available RAM : 11GB

Following was common for all tests :

PostgreSQL parameters changed : shared_buffers = 8GB
pgbench scale factor : 30
pgbench command :
for num in 2 4 6 8 10 12 14
do
   pgbench [-S]  -c $num -j $num -M prepared -T 40
done
What it means is : pgbench is run with increasing number of parallel clients, starting from 2 to 14.


Select-only workload

pgbench -S option is used for read-only workload.


Between 2 and 4 threads, the x86 performance is 30% more than ARM, and the difference rises more and more. Between 4 and 6, the curves flatten a bit, and between 6 and 8, the curves suddenly become steep. After 8, it was expected to flatten out or dip, because the machines had 8 CPUs. But there is more to it. The pgbench clients were running on the same machines where servers were installed. And with fully utilized CPUs, the clients took around 20% of the CPUs. So they start to interfere from 6 threads onward. In spite of that, there is a steep rise between 6 and 8, for both ARM and x86. This is not yet understood by me, but possibly it has something to do with the Linux scheduler, and the interaction between the pgbench clients and the servers. Note that, the curve shape is mostly similar on both x86 and ARM. So this behaviour is not specific to architectures. One difference in the curves, though, is : the ARM curve has a bit bigger dip from 8 threads onward. Also, betweeen 6 and 8, the sudden jump in transactions is not that steep for ARM compared to x86. So the end result in this scenario is : As the CPUs become more and more busy, PostgreSQL on ARM lags behind x86 more and more. Let's see what happens if we remove the interference created by pgbench clients.


select exec_query_in_loop(n)

So, to get rid of the noise occurring because of both client and server on the same machines, I arranged for testing exactly what I intended to test: query performance. For this, pgbench clients can run on different machines, but that might create a different noise: network latency. So instead, I wrote a PostgreSQL C language user-defined function that keeps on executing in a loop the same exact SQL query that is run by this pgbench test. Execute this function using the pgbench custom script. Now, pgbench clients would be mostly idle. Also, this won't take into account the commit/rollback time, because most of the time will be spent inside the C function.

pgbench custom script : select exec_query_in_loop(n);
where n is the number of times the pgbench query will be executed on the server in a loop.
The loop query is the query that gets normally executed with pgbench -S option:
SELECT abalance FROM pgbench_accounts WHERE aid = $1
Check details in exec_query_in_loop()




Now, you see a very different curve. For both curves, upto 8 threads, transactions rate is linearly proportional to number of threads. After 8, as expected, the transactions rate doesn't rise. And it has not dipped, even for ARM. PostgreSQL is consistently around 35% slower on x86 compared to ARM. This sounds not that bad when we consider that the ARM CPU frequency is 2.6 GHz whereas x86 is 3.0 Gz. Note that the transaction rate is single digit, because the function exec_query_in_loop(n) is executed with n=100000.

This experiment also shows that the previous results using built-in pgbench script have to do with pgbench client interference. And that, the dip in curve for ARM for contended threads is not caused by the contention in the server. Note that, the transactions rates are calculated at client side. So even when a query is ready for the results, there may be some delay in the client requesting the results , calculating the timestamp, etc, especially in high contention scenarios.


select exec_query_in_loop(n) - PLpgSQL function

Before using the user-defined C function, I had earlier used a PL/pgSQL function to do the same work. There, I stumbled across a different kind of performance behaviour.





Here, PostgreSQL on ARM is around 65% slower than on x86, regardless of number of threads. Comparing with the previous results that used a C function, it is clear that PL/pgSQL execution is remarkably slower on ARM, for some reason. I checked the perf report, but more or less the same hotspot functions are seen in both ARM and x86. But for some reason, anything executed inside PL/pgSQL function becomes much slower on ARM than on x86.

I am yet to check the cache misses to see if those are more on ARM. As of this writing, what I did was this (some PostgreSQL-internals here) : exec_stmt_foreach_a() calls exec_stmt(). I cloned exec_stmt() to exec_stmt_clone(), and made exec_stmt_foreach_a() call exec_stmt_clone() instead. This sped up the overall execution, but it sped up 20% more for ARM. Why just this change caused this behaviour is kind of a mystery to me as of now. May be it has to do with the location of a function in the program; not sure.


Updates

The default pgbench option runs the tpcb-like built-in script, which has some updates on multiple tables.



Here, the transaction rate is only around 1-10% percent less on ARM compared to x86. This is probably because major portion of the time goes in waiting for locks, and in disk writes during commits. And the disks I used are non-SSD disks. But overall it looks like, updates on PostgreSQL are working good on ARM.


Next thing, I am going to test with aggregate queries, partitions, high number of CPUs (32/64/128), larger RAM and higher scale factor, to relatively see how PostgreSQL scales on the two platforms with large resources.


Conclusion

We saw that PostgreSQL RDBMS works quite robustly on ARM64. While it is tricky to compare the performance on two different platforms, we could still identify which areas it is not doing good by comparing patterns of behaviour in different scenarios in the two platforms.

ARM: Points to be noted


The story of ARM began in 1993 with a joint venture of Apple with ARM (then Acorn RISC Machines) to launch the "Apple Newton" handheld PC. And the story continues today with news that Apple is going to switch their MACs to ARM processors.  What has not changed in the story is ARM's reputation as a power-efficient processor. This is the primary reason why it is so popular in smarthphones, and why it has made its way into smart cars, drones and other internet-of-things devices where it is crucial to preserve battery life and minimize heat generation. Today even data centers can run on ARM. Due to such widespread market disruption happening, I thought about putting some specific points which I think are good-to-know for users and software developers who have just begun using the ARM ecosystem ...

The reason why ARM power consumption is less has to do with the inherent nature of RISC architecture on which ARM is based. RISC instructions are so simple that each of them requires only one clock cycle to execute; so they require less transistors, and hence less power is required and less heat is generated.

Ok, but then why ARM processors started making their way into data centers? After all, mobile phones and data centers don't have anything in common. Or do they?

Well, both consume power, and both need to perform well for a given price. Even though data centers are huge as compared to the size of a mobile phone, their CPU usage is also huge. So power efficiency is equally important. And so is the price for a given performance.


Divide and conquer

So, just replace the existing expensive processors with more number of cheaper ARM processors, so that the total CPU power will be equal to the existing power ? Yes, this does work. Suppose, there are 4 CPUs serving 16 parallel processes, it's better for them to be instead served by 8 or 16 lower performing CPUs.  Overall throughput will likely be higher.

But what if there is a single long database query which needs high CPU power ? Even here, the database query can make use of multiple CPUs to run a parallelized query.  Here we see that even the software needs to adapt to this paradigm shift: divide the task into number of parallel tasks wherever possible. We need to understand the fact that more than the power of a single CPU, what counts is the total power of all the CPUs.

Another thing is that, the worloads are not always high. For instance, cloud service workloads are always mixed, frequently with numerous small tasks, where again a server with large number of low power CPUs fits well.


big.LITTLE

In the ARM's big.LITTLE architecture, there can be two or more cores of different performance capacity in the same SoC. And if the workload processed by one of them changes, the other one can take over that workload on the fly if it is more suitable for the changed workload. This way unnecessary power usage and heat generation is prevented because the low-power processor type gets chosen. There has been support for doing such scheduling particularly for big.LITTLE in the linux kernel.


ARM's licensing model

As many of you might know, ARM does not manufacture chips; it designs them. And it's clients buy its license to manufacture chips based on ARM's design. Now, there are two kinds of licenses.

One is the core license.  When a company buys the core license, it has to manufacture the complete CPU core using ARM's in-house core design without modifying it. The ARM's family of core designs that it licenses, are named Cortex-A**. E.g. in Qualcomm's Snapdragon 855 chipset, all CPU cores are based on Cortex-A series; it means they used the ARM core license.

The other is the ARM architecture license. When a company buys this license and not the core license, it has to design it's own core, but the core design has to be compatible with the ARM instruction set. Such cores are often called custom cores, because they have their own micro-architecture that is not designed by ARM. This provides flexibility to the big companies to build cores as per their own needs. Companies like Qualcomm, Huawei, Apple and Samsung have built such custom cores.

The beauty of this licensing model is : the ready-made core design is available to just anybody (of course a license has to be bought). And hence there are a number of vendors who all have manufactured compatibile chips. This drives innovation and competition.


Applications

Applications for mobile devices were already written from scratch on ARM processers. But what about the software running on servers ? Well, Linux kernel has support for ARM, so OSes like Ubuntu, CentOS and Debian already have officially supported ARM images. Furthermore, if you are running on, say Ubuntu, almost all the usual x86 packages that are present in the Ubuntu repository are already there for ARM as well, at least for ARMv8. I was able to install the PostgreSQL database package, and have been running pgbench with high contention, and it runs just fine. (Probably in later blogs, I will elaborate on PostgreSQL further) Also, the compilers like gcc/g++ are already tuned for ARM architecture, so most of the hardware-specific compiler optimizations are transparently done for ARM.

But when it comes to running software meant for data servers, a lot of adaptation might be required to have a reasonable performance. For instance, applications have to be aware of the implications of the ARM's weak memory model, especially for code synchronizatoin. Secondly, they should leverage in-built ARM capabilities like NEON (which is the ARM's brand name for SIMD) to parallelize same operation on multiple data; and so on.

A lot of research and analysis is going on to optimize sofware running in the ARM ecosystem as a whole. But we are already seeing a gradual transition and adaptation to this ecosystem.

Backtraces in PostgreSQL

PostgreSQL 13 has introduced a simple but extremely useful capability to log a stack trace into the server logs when an error is reported. L...