Database World

A quick sanity testing of pgpool-II on ARM64

2021-07-07T12:52:00.001-07:00

pgpool-II is a well-known tool to pool PostgreSQL connections and load-balance work loads. In this blog, we will verify whether pgpool works well on ARM64 architecture.

ARM64 packages

On Ubuntu, pgpool2 ARM64 debian package is made available by the Debian PostgreSQL Maintainers :

$ dpkg -s pgpool2
Maintainer: Debian PostgreSQL Maintainers <team+postgresql@tracker.debian.org>
Architecture: arm64
Version: 4.1.4-3.pgdg18.04+2
Replaces: pgpool
Depends: libpgpool0 (= 4.1.4-3.pgdg18.04+2), lsb-base (>= 3.0-3), postgresql-common (>= 26), ucf, libc6 (>= 2.17), libmemcached11, libpam0g (>= 0.99.7.1), libpq5, libssl1.1 (>= 1.1.0)

On CentOS, the ARM64 yum packages are not available. But you can always download the source from the repository or using git and then install from the source.

Installation from source

pgpool build requires installed PostgreSQL on the same machine. The build uses 'pg_config' command to locate the required PostgreSQL header files, library files, and executables. So ensure that your PATH environment variable has the PostgreSQL bin directory so that './configure' pickups up the right pg_config :

export PATH=<PostgreSQL_rootdir>/bin:$PATH
$ ./configure --prefix=<pgpool_installation_location>
pgpool will be installed at the location specified by --prefix.

Alternatively, you can explicitly specify the required PostgreSQL directories :
$ PGDIR=<PostgreSQL_rootdir>
$ ./configure --prefix=<pgpool_installation_location> --with-pgsql-includedir=$PGDIR/include --with-pgsql-libdir=$PGDIR/lib -with-pgsql-bindir=$PGDIR/bin

If you downloaded the source from the git repository, you need to set the branch to the latest stable one :
$ git branch --track V4_2_STABLE remotes/origin/V4_2_STABLE
$ git checkout V4_2_STABLE
$ git branch
* V4_2_STABLE
master

Now install pgpool2:
$ make
$ make install

Create a pcp.conf file out of the supplied sample file:
export PGPOOLDIR=<pgpool_top_dir>
cd $PGPOOLDIR
cp ./etc/pcp.conf.sample ./etc/pcp.conf

pgpool supports different clustering modes. We will use the most popular and recommended clustering mode : streaming replication mode. Use the relevant sample configuration file to set up our configuration:
cp etc/pgpool.conf.sample-stream etc/pgpool.conf
This file always has the backend_clustering_mode set to 'streaming_replication'. We do not have to change it.

We need to set up PostgreSQL streaming replication before starting up pgpool. The following assumes that streaming replication is already setup, with the data directories for primary and standby server located at /home/amit/replication/pg/data/primary and /home/amit/replication/pg/data/standby, respectively. And they are running on port 26501 and 26502, respectively.

Configuration file

In the etc/pgpool.config file, I did these modifications :

pid_file_name = 'pgpool.pid'
listen_addresses = '*'
sr_check_user = 'amit'
backend_hostname0 = 'localhost'
backend_port0 = 26501
backend_weight0 = 1
backend_data_directory0 = '/home/amit/replication/pg/data'
backend_hostname1 = 'localhost'
backend_port1 = 26502
backend_weight1 = 1
backend_data_directory1 = '/home/amit/replication/slave/data'
logdir = '/home/amit/ins/pgpool/log'

'pid_file_name' specifies the full path to a file to store the pgpool2 process id. The default value is /var/run/pgpool/pgpool.pid. This directory path was not accessible, so I had to change it to relative path name, i.e. relative to the location of pgpool.conf file.
'sr_check_user' specifies the PostgreSQL user name to perform streaming replication check. The default value 'nobody' didn't work for obvious reasons.
'backend_hostname' and 'backend_port' specify the hostname and port on which primary and standby are running. backend_data_directory points to their data directory. In our case, there are only two instances, so we used suffixes 0 and 1. We can add more instances with incrementing number suffixes.
I had configured streaming replication with both primary and standby on the same machine where pgpool is installed. So used 'localhost' for backend_hostname for both of them.
backend_weight0 and backend_weight1 together specify the ratio in which the load is balanced among the PostgreSQL instances. E.g. 1:1 means equal balancing, 1:3 means standby should bear around 3/4th of the total load; etc.
Setting 'logdir' to an existing directory is important, because the node status is stored in that path. I observed that not setting this parameter results in only one node getting used, which means load balancing does not happen.

Starting up pgpool

Now let's bring the pgpool binary location into the PATH before we bring up pgpool :
export PGPOOLDIR=<pgpool_top_dir>
export LD_LIBRARY_PATH=$PGPOOLDIR/lib:$LD_LIBRARY_PATH
export PATH=$PGPOOLDIR/bin:$PATH

We have both pcp.conf and pgpool.conf files on default paths ($PGPOOLDIR/etc), with default names, so we don't have to explicitly specify them using -f or -F option of pgpool command:

$ pgpool -n
2021-07-08 01:44:12: main pid 14435: LOG: health_check_stats_shared_memory_size: requested size: 12288
2021-07-08 01:44:12: main pid 14435: LOG: memory cache initialized
2021-07-08 01:44:12: main pid 14435: DETAIL: memcache blocks :64
2021-07-08 01:44:12: main pid 14435: LOG: allocating (136571704) bytes of shared memory segment
2021-07-08 01:44:12: main pid 14435: LOG: allocating shared memory segment of size: 136571704
2021-07-08 01:44:12: main pid 14435: LOG: health_check_stats_shared_memory_size: requested size: 12288
2021-07-08 01:44:12: main pid 14435: LOG: health_check_stats_shared_memory_size: requested size: 12288
2021-07-08 01:44:12: main pid 14435: LOG: memory cache initialized
2021-07-08 01:44:12: main pid 14435: DETAIL: memcache blocks :64
2021-07-08 01:44:12: main pid 14435: LOG: pool_discard_oid_maps: discarded memqcache oid maps
2021-07-08 01:44:12: main pid 14435: LOG: Setting up socket for 0.0.0.0:9999
2021-07-08 01:44:12: main pid 14435: LOG: Setting up socket for :::9999
2021-07-08 01:44:12: main pid 14435: LOG: find_primary_node_repeatedly: waiting for finding a primary node
2021-07-08 01:44:12: main pid 14435: LOG: find_primary_node: primary node is 0
2021-07-08 01:44:12: main pid 14435: LOG: find_primary_node: standby node is 1
2021-07-08 01:44:12: health_check pid 14474: LOG: process started
2021-07-08 01:44:12: health_check pid 14475: LOG: process started
2021-07-08 01:44:12: sr_check_worker pid 14473: LOG: process started
2021-07-08 01:44:12: pcp_main pid 14472: LOG: PCP process: 14472 started
2021-07-08 01:44:12: main pid 14435: LOG: pgpool-II successfully started. version 4.2.3 (chichiriboshi)
2021-07-08 01:44:12: main pid 14435: LOG: node status[0]: 1
2021-07-08 01:44:12: main pid 14435: LOG: node status[1]: 2

Ok, so pgpool is now up. It is listening on port 9999, which is the value that came from pgconf.conf.sample. Now any PostgreSQL client can connect with "-p 9999".

I can see 32 pgpool processes waiting for connection request :
amit      9884 9881 0 00:02 pts/3    00:00:00 pgpool: wait for connection request
amit      9885 9881 0 00:02 pts/3    00:00:00 pgpool: wait for connection request
amit      9886 9881 0 00:02 pts/3    00:00:00 pgpool: wait for connection request
amit      9887 9881 0 00:02 pts/3    00:00:00 pgpool: wait for connection request
............
............
These are pre-forked pgpool2 processes. pgpool parameter 'num_init_children' decides how many such processes should be spawned, as part of the connection pool.

Now let's see if pgpool identifies the primary and standby :

$ psql -p 9999 -c "show pool_nodes"
node_id | hostname | port | status | lb_weight | role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+-----------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0       | localhost | 26501 | up     | 0.500000 | primary | 0          | true              | 0                 |                   |                        | 2021-07-08 01:41:34
1       | localhost | 26502 | up     | 0.500000 | standby | 0          | false             | 0                 |                   |                        | 2021-07-08 01:41:34
(2 rows)

Load Balancing

Now let's run pgbench with a read-only workload, using pgpool:

pg:s2:pgpool$ pgbench -c 30 -j 30 -p 9999 -T 30 -n -S
pgbench (PostgreSQL) 14.0
transaction type: <builtin: select only>
scaling factor: 30
query mode: simple
number of clients: 30
number of threads: 30
duration: 30 s
number of transactions actually processed: 834074
latency average = 1.078 ms
initial connection time = 20.744 ms
tps = 27817.538209 (without initial connection time)

$ psql -p 9999 -c "show pool_nodes"
node_id | hostname | port | status | lb_weight | role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+-----------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0       | localhost | 26501 | up     | 0.500000 | primary | 362792     | true              | 0                 |                   |                        | 2021-07-08 01:52:23
1       | localhost | 26502 | up     | 0.500000 | standby | 471282     | false             | 0                 |                   |                        | 2021-07-08 01:52:23
(2 rows)

We can see that the select_cnt column shows roughly equal number of select statements, which is expected, given 1:1 load_balance weight. So let's change the ratio to 1:5 :
backend_weight0 = 1
backend_weight1 = 5

pg:s2:pgpool$ psql -p 9999 -c "show pool_nodes"
node_id | hostname | port | status | lb_weight | role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+-----------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0       | localhost | 26501 | up     | 0.166667 | primary | 165955     | false             | 0                 |                   |                        | 2021-07-08 02:04:35
1       | localhost | 26502 | up     | 0.833333 | standby | 665867     | true              | 0                 |                   |                        | 2021-07-08 02:04:35

The select_cnt for standby is indeed much more than for primary.

Read-write work-load

Let's run the pgbench tpcb-like work load with pgpool :

pg:s2:pgpool$ pgbench -c 30 -j 30 -p 9999 -T 40 -n
pgbench (PostgreSQL) 14.0
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 30
query mode: simple
number of clients: 30
number of threads: 30
duration: 40 s
number of transactions actually processed: 80917
latency average = 14.866 ms
initial connection time = 24.412 ms
tps = 2018.063384 (without initial connection time)

$ psql -p 9999 -c "show pool_nodes"
node_id | hostname | port | status | lb_weight | role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+-----------+-------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0       | localhost | 26501 | up     | 0.500000 | primary | 80883      | false             | 0                 |                   |                        | 2021-07-08 02:36:43
1       | localhost | 26502 | up     | 0.500000 | standby | 0          | true              | 0                 |                   |                        | 2021-07-08 02:36:43

You can see that standby did not get any select queries, even when there is one select statement in the tpcb-like script. The reason is: the select statements are preceded by DML statements in the same transaction. In such case, pgpool decides not to load-balance the select statements within the transaction, so that the select statements can see the updates made by the ongoing transaction.

Performance impact of load balancing

To observe any improvement in performance numbers due to load balancing, I am going to run some long running queries so as to reduce the noise of connection pooling and pgpool statement parsing. With my setup of an 8 CPU VM, pgbench gives slower results when pgpool is used, due to the connection noise.

So my pgbench custom script my.sql looks like this :
select avg(aid), avg(bid) , avg(abalance) from pgbench_accounts;
select avg(aid), avg(bid) , avg(abalance) from pgbench_accounts;
select avg(aid), avg(bid) , avg(abalance) from pgbench_accounts;
select avg(aid), avg(bid) , avg(abalance) from pgbench_accounts;
...........
...........

To simulate primary and secondary as though they are on different machines, let's run each of them on a separate set of 4 CPUs:
numactl --physcpubind=0-3 pg_ctl -D $MASTERDATA -l $MASTERDATA/postgres.log -w start
numactl --physcpubind=4-7 pg_ctl -D $STANDBYDATA -l $STANDBYDATA/postgres.log -w start

First, let's run pgbench directly with the primary server :
$ pgbench -c 8 -j 8 -p 26501 -T 60 -n -f /tmp/my.sql
pgbench (PostgreSQL) 14.0
transaction type: /tmp/my.sql
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 8
duration: 60 s
number of transactions actually processed: 18
latency average = 34973.239 ms
initial connection time = 4.879 ms
tps = 0.228746 (without initial connection time)

Now, let's run pgbench with pgpool :
$ pgbench -c 8 -j 8 -p 9999 -T 60 -n -f /tmp/my.sql
pgbench (PostgreSQL) 14.0
transaction type: /tmp/my.sql
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 8
duration: 60 s
number of transactions actually processed: 29
latency average = 21008.485 ms
initial connection time = 7.001 ms
tps = 0.380799 (without initial connection time)

You can see that the transactions processed with pgpool were 29, whereas when bypassed pgpool, only 18 transactions were processed. This difference is because pgpool distributes the select queries over to the primary and standby.

Conclusion

Debian packages are available for pgpool on ARM64. For platforms where Arm64 pgpool package is not available, installation using source goes smoothly.

We have seen that pgpool load balancing works sanely on ARM64.

There are so many other parameters and features that could be tested, but it would take another blog to cover them. In a future blog, I hope to verify the automatic failover of pgpool.

A quick sanity testing of PostgreSQL parallel query on Arm64

2021-04-13T05:27:00.001-07:00

Query parallelism is supported in PostgreSQL since quite a while now. People in the PG community call it "Parallel query", but by now it is not limited to just SELECT queries. Index build leverages multiple cores; and even utilities like VACUUM now make use of parallelism. Furthermore, community is working on parallelizing COPY and INSERTs.

I was interested to do kind-of "sanity" check of this capability specifically on ARM64 platform. Let's see how it goes. And also at the same time, we will try to understand little bit of how to interpret the parallelism part of the plan output. Subqueries and partitions are not covered in this blog; probably I will add it in another blog.

For running the queries I generated a scale-5 TPC-H benchmark schema with the help of scripts taken from https://github.com/tvondra/pg_tpch.git. My machine is an 8 CPU VM with 15GB memory and Ubuntu 18.04, running on a "Kunpeng 920" 2.6 GHz host. The PostgreSQL build was using git master branch, so you can treat it somewhere between PostgreSQL 13 and 14. All the tests were run with max_parallel_workers_per_gather = 4. The tables were pre-warmed, so I reduced seq_page_cost and random_page_cost to as low as 0.1.

The JIT-related part of the EXPLAIN output is omitted from the plans to keep the focus on the main query plan. Also, estimated costs are omitted in order to make the plan output compact.

Parallel sequential scan

This is the simplest one, and the one with which query parallelism got introduced in PostgreSQL 9.6.

Just a plain "select * from lineitem" won't give us a parallel scan, because all the tuples need to be transferred from workers to the leader backend. Parallel scan is beneficial only when this tuple transfer cost is small enough. So let's reduce the number of rows selected :

tpch=# explain (analyze, costs off)

tpch-# select l_orderkey from lineitem where l_shipmode = 'AIR' and l_shipinstruct = 'TAKE BACK RETURN';

QUERY PLAN

--------------------------------------------------------------------------------------------------

Gather (actual time=6.264..1776.956 rows=1070891 loops=1)

Workers Planned: 4

Workers Launched: 4

-> Parallel Seq Scan on lineitem (actual time=6.959..1640.647 rows=214178 loops=5)

Filter: ((l_shipmode = 'AIR'::bpchar) AND (l_shipinstruct = 'TAKE BACK RETURN'::bpchar))

Rows Removed by Filter: 5785781

Planning Time: 0.205 ms

Execution Time: 1823.987 ms

So parallel sequential scan took 1824 ms to execute. Let's compare this with sequential scan :

tpch=# set max_parallel_workers_per_gather TO 0; -- Disable parallelism

QUERY PLAN

--------------------------------------------------------------------------------------------

Seq Scan on lineitem (actual time=117.795..5077.520 rows=1070891 loops=1)

Filter: ((l_shipmode = 'AIR'::bpchar) AND (l_shipinstruct = 'TAKE BACK RETURN'::bpchar))

Rows Removed by Filter: 28928904

Planning Time: 0.101 ms

Execution Time: 5123.774 ms

So parallel seqscan was around 2.5 times faster.

Just a background before we go for other queries ... Parallelism is achieved by distributing the table blocks to the workers, and the parallel workers would then do their job of reading and processing tuples from the blocks they read. But how is it made sure that no two workers scan the same block ? After all, they are running in parallel, so they should make sure that each block should be scanned only by one particular worker, otherwise duplicate rows would be returned. To make this happen, there is a coordination between the workers. They all are *aware* that they are all running in parallel, so they keep a shared "next block to read" pointer, which each worker updates once it chooses it's own next block. This type of parallel plan node is called "parallel-aware"; it has a prefix "Parallel" before the plan name in the EXPLAIN output. A plan node sitting on top of such parallel-aware node might itself be running in a parallel worker, but it may not be aware of it, while actually it is processing only a partial set of rows in parallel since the underlying parallel-seq scan is processing its own set of table blocks. Such plan can be called as "parallel-oblivious" plan, for the sake of naming it. We will talk about this more when we discuss parallel joins and aggregates.

Another thumb-rule is : A Gather node is the umbrella parallel node, under which all the nodes in the subtree are run in parallel by workers. A Gather node's job is to gather the tuples returned by each worker, and pass it on to the upper node. All the nodes above Gather run in the usual parent backend. There cannot be nested Gather nodes.

Index Scan

The following query didn't produce a parallel index scan :

tpch=# explain (analyze, costs off)

select l_partkey from lineitem where l_partkey < 100000;

QUERY PLAN

------------------------------------------------------------------------------------------------------------

Index Only Scan using idx_lineitem_part_supp on lineitem (actual time=0.078..895.358 rows=2999506 loops=1)

Index Cond: (l_partkey < 100000)

Heap Fetches: 0

Planning Time: 0.129 ms

Execution Time: 1012.693 ms

So let's try reducing parallel_tuple_cost, for the sake of reproducing a parallel index scan :

tpch=# set parallel_tuple_cost TO 0.0001;

tpch=# explain (analyze, costs off) select l_partkey from lineitem where l_partkey < 100000;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------

Gather (actual time=0.390..387.086 rows=2999506 loops=1)

Workers Planned: 4

Workers Launched: 4

-> Parallel Index Only Scan using idx_lineitem_part_supp on lineitem (actual time=0.098..262.780 rows=599901 loops=5)

Index Cond: (l_partkey < 100000)

Heap Fetches: 0

Planning Time: 0.802 ms

Execution Time: 509.306 ms

Notes:

parallel_tuple_cost is the cost of transferring tuples from workers to leader backend. Note that I used a contrived value of .0001 just for the sake of reproducing a parallel index scan. Although setting it is giving us an index scan with faster execution time, it is not recommended to change these costing parameters without obtaining conclusive statistics on your system.

Index-only scan is a special kind of Index Scan, in that the index already has the data required by the select query, so a separate heap scan is avoided; only index is scanned. A plain index scan or a bitmap heap scan also supports parallelism; we will see more of these in aggregate or table join examples where they are more commonly seen.

A parallel index scan does not produce ordered results, unlike a non-parallel index scan. Multiple workers read index blocks in parallel. So although each worker returns its own tuples sorted, together the result set is not sorted due to parallel index block reads.

Parellel index scan is only supported for a btree index.

Parallel Aggregate

An aggregate expression in a query typically has drastically less number of rows returned, compared to the number of table rows, inheritently since the values are aggregate values returned from over a row set. So there is very less worker-leader tuple transfer cost involved, so aggregate query almost always gets benefited due to parallelism.

tpch=# -- Check out sequential aggregate plan

tpch=# set max_parallel_workers_per_gather TO 0;

tpch=# explain (analyze, costs off) select max(l_tax) from lineitem;;

QUERY PLAN

--------------------------------------------------------------------------------

Aggregate (actual time=11230.951..11230.951 rows=1 loops=1)

-> Seq Scan on lineitem (actual time=0.009..2767.802 rows=29999795 loops=1)

Planning Time: 0.105 ms

Execution Time: 11231.739 ms

tpch=# -- Check out parallel aggregate plan

tpch=# set max_parallel_workers_per_gather TO 4;

tpch=# explain (analyze, costs off) select max(l_tax) from lineitem;;

QUERY PLAN

---------------------------------------------------------------------------------------------------

Finalize Aggregate (actual time=2150.383..2190.898 rows=1 loops=1)

-> Gather (actual time=2150.241..2190.883 rows=5 loops=1)

Workers Planned: 4

Workers Launched: 4

-> Partial Aggregate (actual time=2137.664..2137.665 rows=1 loops=5)

-> Parallel Seq Scan on lineitem (actual time=0.016..563.268 rows=5999959 loops=5)

Planning Time: 0.896 ms

Execution Time: 2202.304 ms

So it's 5 times faster; pretty neat.

Above, we can see that the usual Aggregate plan node is divided into two kinds of Aggregate plan nodes. The one below Gather node is the Partial Aggregate node, which, as it name implies, does an aggregate of only the values returned by its own worker. It means that it has not run the finalize function yet. That is the task of the Finalize Aggregate, which combines the partial aggregates returned by all the workers through the Gather node.

Joins

We will analyze the three different joins using this query :

select avg(l_discount) from orders, lineitem

where

l_orderkey = o_orderkey

and o_orderdate < date '1995-03-09'

and l_shipdate > date '1995-03-09';

Merge Join

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------

Finalize Aggregate (actual time=5030.051..5241.390 rows=1 loops=1)

-> Gather (actual time=5029.991..5241.370 rows=5 loops=1)

Workers Planned: 4

Workers Launched: 4

-> Partial Aggregate (actual time=5015.939..5015.940 rows=1 loops=5)

-> Merge Join (actual time=199.287..4987.159 rows=149782 loops=5)

Merge Cond: (lineitem.l_orderkey = orders.o_orderkey)

-> Parallel Index Scan using idx_lineitem_orderkey on lineitem (actual time=198.962..2095.747 rows=3248732 loops=5)

Filter: (l_shipdate > '1995-03-09'::date)

Rows Removed by Filter: 2751227

-> Index Scan using orders_pkey on orders (actual time=0.057..2343.402 rows=3625756 loops=5)

Filter: (o_orderdate < '1995-03-09'::date)

Rows Removed by Filter: 3874054

Planning Time: 0.290 ms

Execution Time: 5243.194 ms

As you can see above, Merge Join is under a Gather node. That means, Merge Join is being executed in a parallel worker. Is the merge join being executed using some coordination with other workers ? Or, in other words, is Merge Join parallel-aware ? No. As you can see, the Merge Join does not have a "Parallel" prefix. Merge Join needs sorted input from both outer side and inner side, hence we have index scans both at inner side and outer side. Now, when a Merge Join is executed in a worker, a subset of outer side table is joined with full inner side. This is possible because the outer side is scanned using parallel Index Scan, and the inner side is a normal Index scan which means each worker does a full Index Scan of inner side. Effectively, the Merge join data is divided, thanks to the data that got divided by underlying Parallel Index Scan, and the Merge join does not even know that it is being run in parallel ! The caveat is that the inner side has to be redundantly scanned fully by each worker, followed by a sort if required. In our case the sort operation was not necessary becaues of the index.

There is a scope for improvement to make the Merge Join parallel-aware, by appropriately partitioning sorted data of both tables and do Merge Join of the pairs of partitioned data sets in parallel. But that discussion would need a separate blog.

Parallel-aware Hash Join

QUERY PLAN

-------------------------------------------------------------------------------------------------------------

Finalize Aggregate (actual time=3054.189..3099.784 rows=1 loops=1)

-> Gather (actual time=3022.810..3099.755 rows=5 loops=1)

Workers Planned: 4

Workers Launched: 4

-> Partial Aggregate (actual time=3007.931..3007.934 rows=1 loops=5)

-> Parallel Hash Join (actual time=643.552..2980.305 rows=149782 loops=5)

Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)

-> Parallel Seq Scan on lineitem (actual time=0.030..1685.258 rows=3248732 loops=5)

Filter: (l_shipdate > '1995-03-09'::date)

Rows Removed by Filter: 2751227

-> Parallel Hash (actual time=639.508..639.508 rows=725169 loops=5)

Buckets: 4194304 Batches: 1 Memory Usage: 174688kB

-> Parallel Seq Scan on orders (actual time=14.083..384.196 rows=725169 loops=5)

Filter: (o_orderdate < '1995-03-09'::date)

Rows Removed by Filter: 774831

Planning Time: 0.300 ms

Execution Time: 3101.937 ms

The inner side of Hash Join is a "Parallel Hash" node. This plan builds a shared hash table by dividing the work among parallel coordinating workers. As with a sequential Hash Join, the outer side waits until the hash table is built. Once it is built, the same workers now start scanning the outer table and doing the join using the shared hash table. The outer scan is essentially a partial scan because each worker does it in parallel. So in our case, it's a parallel sequential scan.

Parallel-oblivious Hash Join

If the inner side is just a Hash node rather than a "Parallel Hash" node, then it means: a separate full hash table will be built by each of the workers rather than having a shared hash table, which obviously would be expensive than the parallel hash due to the absence of division of hash building work:

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------

Finalize Aggregate (actual time=5908.032..5971.214 rows=1 loops=1)

-> Gather (actual time=5852.417..5971.167 rows=5 loops=1)

Workers Planned: 4

Workers Launched: 4

-> Partial Aggregate (actual time=5850.930..5850.933 rows=1 loops=5)

-> Hash Join (actual time=2309.307..5826.753 rows=149782 loops=5)

Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)

-> Parallel Seq Scan on lineitem (actual time=12.631..1712.443 rows=3248732 loops=5)

Filter: (l_shipdate > '1995-03-09'::date)

Rows Removed by Filter: 2751227

-> Hash (actual time=2290.063..2290.065 rows=3625845 loops=5)

Buckets: 2097152 Batches: 4 Memory Usage: 48222kB

-> Bitmap Heap Scan on orders (actual time=502.264..1512.424 rows=3625845 loops=5)

Recheck Cond: (o_orderdate < '1995-03-09'::date)

Heap Blocks: exact=138113

-> Bitmap Index Scan on idx_orders_orderdate (actual time=451.552..451.552 rows=3625845 loops=5)

Index Cond: (o_orderdate < '1995-03-09'::date)

Planning Time: 0.291 ms

Execution Time: 5977.966 ms

Nested Loop Join

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------

Finalize Aggregate (actual time=7211.122..7258.289 rows=1 loops=1)

-> Gather (actual time=7193.150..7258.259 rows=5 loops=1)

Workers Planned: 4

Workers Launched: 4

-> Partial Aggregate (actual time=7129.209..7129.210 rows=1 loops=5)

-> Nested Loop (actual time=13.924..7100.095 rows=149782 loops=5)

-> Parallel Seq Scan on lineitem (actual time=13.621..1919.712 rows=3248732 loops=5)

Filter: (l_shipdate > '1995-03-09'::date)

Rows Removed by Filter: 2751227

-> Result Cache (actual time=0.001..0.001 rows=0 loops=16243662)

Cache Key: lineitem.l_orderkey

Hits: 2450631 Misses: 844081 Evictions: 0 Overflows: 0 Memory Usage: 61379kB

Worker 0: Hits: 2443189 Misses: 841050 Evictions: 0 Overflows: 0 Memory Usage: 61158kB

Worker 1: Hits: 2350093 Misses: 808929 Evictions: 0 Overflows: 0 Memory Usage: 58824kB

Worker 2: Hits: 2424018 Misses: 833681 Evictions: 0 Overflows: 0 Memory Usage: 60615kB

Worker 3: Hits: 2417114 Misses: 830876 Evictions: 0 Overflows: 0 Memory Usage: 60407kB

-> Index Scan using orders_pkey on orders (actual time=0.004..0.004 rows=0 loops=4158617)

Index Cond: (o_orderkey = lineitem.l_orderkey)

Filter: (o_orderdate < '1995-03-09'::date)

Rows Removed by Filter: 1

Planning Time: 0.294 ms

Execution Time: 7268.857 ms

By nature, nested loop join has to have the whole inner side scanned for each of the outer tuple. So we can divide the outer scan among workers, and have a complete inner table scan by each of the workers, which will give us a parallel-oblivious Nested Loop Join. There is no need to make it parallel-aware.

Sequential Join

If we disable parallelism, we can see a sequential hash join. Note that all of the above parallel joins are reasonably fater than the below sequential join ...

tpch=# set max_parallel_workers_per_gather TO 0;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------

Aggregate (actual time=15714.776..15714.779 rows=1 loops=1)

-> Hash Join (actual time=5134.219..15603.861 rows=748912 loops=1)

Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)

-> Bitmap Heap Scan on lineitem (actual time=2837.938..7162.214 rows=16243662 loops=1)

Recheck Cond: (l_shipdate > '1995-03-09'::date)

Heap Blocks: exact=607593

-> Bitmap Index Scan on idx_lineitem_shipdate (actual time=2556.845..2556.845 rows=16243662 loops=1)

Index Cond: (l_shipdate > '1995-03-09'::date)

-> Hash (actual time=2290.201..2290.202 rows=3625845 loops=1)

Buckets: 2097152 Batches: 4 Memory Usage: 48222kB

-> Bitmap Heap Scan on orders (actual time=563.536..1548.176 rows=3625845 loops=1)

Recheck Cond: (o_orderdate < '1995-03-09'::date)

Heap Blocks: exact=138113

-> Bitmap Index Scan on idx_orders_orderdate (actual time=333.284..333.285 rows=3625845 loops=1)

Index Cond: (o_orderdate < '1995-03-09'::date)

Planning Time: 0.267 ms

Execution Time: 15727.275 ms

Gather Merge

tpch=# explain (analyze, costs off)

select l_orderkey from lineitem where l_suppkey > 10000 order by l_suppkey ;

QUERY PLAN

----------------------------------------------------------------------------------------------

Gather Merge (actual time=3351.705..8310.367 rows=23998124 loops=1)

Workers Planned: 4

Workers Launched: 4

-> Sort (actual time=3181.446..3896.115 rows=4799625 loops=5)

Sort Key: l_suppkey

Sort Method: external merge Disk: 136216kB

Worker 0: Sort Method: external merge Disk: 120208kB

Worker 1: Sort Method: external merge Disk: 116392kB

Worker 2: Sort Method: external merge Disk: 123520kB

Worker 3: Sort Method: external merge Disk: 114264kB

-> Parallel Seq Scan on lineitem (actual time=55.688..915.160 rows=4799625 loops=5)

Filter: (l_suppkey > 10000)

Rows Removed by Filter: 1200334

Planning Time: 0.102 ms

Execution Time: 9654.078 ms

Gather Merge is a modified version of the Gather plan. It basically parallelizes the sort. So Gather gets sorted output from the workers, and then it merges them and returns sorted output.

A sequential Sort took almost almost twice longer :

QUERY PLAN

---------------------------------------------------------------------------------

Sort (actual time=14399.200..18068.514 rows=23998124 loops=1)

Sort Key: l_suppkey

Sort Method: external merge Disk: 610560kB

-> Seq Scan on lineitem (actual time=16.346..4320.823 rows=23998124 loops=1)

Filter: (l_suppkey > 10000)

Rows Removed by Filter: 6001671

Planning Time: 0.086 ms

Execution Time: 20015.980 ms

There are lot of other scenarios where parallelism can be observed, but probably I will take it up in a later blog ....

Need for external compression methods in PostgreSQL

2020-08-30T07:58:00.002-07:00

Every modern database system has some way to compress its data at some level. The obvious reason for this feature is to reduce the size of it's database, especially in today's world where the data is growing exponentially. The less obvious reason is to improve query performance; the idea is: smaller data size means less data pages to scan, which means lesser disk i/o and faster data access. So, in any case, data de-compression should be fast enough so as not to hamper the query performance, if not improve it.

Compression is offered at different levels : page compression, row compression, column compression, etc. Columnar databases have the advantage of a very high compression ratio of its column because of presence of a repetetive pattern of contiguous data in a column. Another case is when, in a row oriented database, the column values are so large that it makes sense to compress individual values of the column. Such values can even be kept separately if they do not fit in a single page. And the row has pointers to the out-of-line compressed data. In PostgreSQL, such technique is called TOAST (The Oversized-Attribute Storage Technique), where, for columns that can contain variable-length data, the data is transparently compressed and stored in the same row, or else if it is still too large, it is stored in smaller chunks as rows in a separate table called a toast table, where these chunks themselves may or may not be compressed.

Compression is offered for different purposes. It may not be restricted for only data compression. E.g. in a replication system, the transfer of redo logs from the master to slave can become a huge network bottleneck, so many RDBMS offer to compress redo logs.

And then comes the compression algorithms that the RDBMS uses or gives options to choose. This applies especially more to data compression. Since data is user's data, a specific pattern in the user data might suit a particular compression algorithm, while a different pattern might be suitable for another compression algorithm. Moreover, this implies that it would be far more beneficial if the RDBMS gives an option to choose a specific compression algorithm for a specific column or a specific user-defined type out of a list of well-known standard compression libraries such as zlib, lz4, ztd, snappy, gzip, etc. Or, the library algorithm may very well be a completely customized one.

Secondly, there has been a lot of advancements to optimize compression algorithms for specific platforms, and provide hardware accelerators for Compression, Encryption and SIMD that are closely coupled to CPU cores, which can then be levergaed by compression or encryption algorithms. One such example is the Kunpeng Zlib Acceleration Engine, which offers a hardware-enabled infrastructure for compression on a "Kunpeng 920" ARM64 processor. I haven't got a chance to test this capability, but it does sound promising.

Furthermore, the compression/encryption algorithms inherently do repetitive tasks over the data, which is a natural fit for leveraging SIMD vectorization. There has been independent projects going on on both ARM64 and Intel to do such platform-specific enhancements in well known libraries like zlib, lz4 etc. Check out this NEON Intrinsics case study that optimizes zlib's adler-32 algorithm using NEON intrinsics.

All this directly points to an urgent need for RDBMS servers to give users a choice for specific native compression algorithms/libraries for specific tables or specific columns. As of this writing, PostgreSQL uses its own built-in compression algorithm based on LZ for toast table compression. Imagine if there were an interface to select zlib instead of the built-in algorithm. Further, select the zlib compression level. Still further, add an interface for users to create an extension that uses a customized algorithm native to a specific platform that uses hardware acceleration.

Well, there is exactly such a proposed feature in the making. Check out this discussion thread in the PostgreSQL hackers community. It may be a long way to go (as of this writing), but I am very hopeful of this feature going in, because the use-cases are strong enough as shown above, there are no fundamental objections to this functionality, and there are work-in-progress patches submitted.

I went ahead and applied this patch, and played around it. Roughly, below is how the interface looks like. After the patch-set fully materializes, the interface might be different, but I think the essence of it would remain more or less the same. Below is the output of my tests; please note that it is just to emphasize with examples how cool and useful this feature would be, and to make sense of whatever I explained above in this blog.

CREATE TABLE zlibtab(t TEXT COMPRESSION zlib WITH (level '4'));
CREATE TABLE lztab(t TEXT);
ALTER TABLE lztab ALTER COLUMN t SET COMPRESSION pglz;

pgg:s2:pg$ time psql -c "\copy zlibtab from text.data"
COPY 13050

real    0m1.344s
user    0m0.031s
sys     0m0.026s

pgg:s2:pg$ time psql -c "\copy lztab from text.data"
COPY 13050

real    0m2.088s
user    0m0.008s
sys     0m0.050s


pgg:s2:pg$ time psql -c "select pg_table_size('zlibtab'::regclass), pg_table_size('lztab'::regclass)"
pg_table_size | pg_table_size
---------------+---------------
       1261568 |       1687552

pgg:s2:pg$ time psql -c "select NULL from zlibtab where t like '0000'" > /dev/null

real    0m0.127s
user    0m0.000s
sys     0m0.002s

pgg:s2:pg$ time psql -c "select NULL from lztab where t like '0000'" > /dev/null

real    0m0.050s
user    0m0.002s
sys     0m0.000s

Notice how two different compression algorithms differ in the compressed size, and the speed of inserting data (compression) and selecting data (decompression).

You would even be able to create a new compression access method using the same way as we do for creating a new index :
CREATE ACCESS METHOD pglz1 TYPE COMPRESSION HANDLER my_compression_handler;
where my_compression_handler should be a PostgreSQL C function that could be created using a PostgreSQL extension. This function assigns its own implementation functions for a set of pre-defined hooks that define everything that the PostgreSQL core needs to know to make use of the compression access method :

Datum
my_compression_handler(PG_FUNCTION_ARGS)
{
       CompressionAmRoutine *routine = makeNode(CompressionAmRoutine);

       routine->cmcheck = my_cmcheck;
       routine->cminitstate = my_cminitstate;
       routine->cmcompress = my_cmcompress;
       routine->cmdecompress = my_cmdecompress;
       routine->cmdecompress_slice = NULL;

       PG_RETURN_POINTER(routine);
}

This is PostgreSQL's way of being highly extensible : Allow user to use built-in methods, but also provide a way for the user to define his/her own methods for doing the same job. All the above functions would be inside an PostgreSQL extension, that could be created using:
CREATE EXTENSION my_compression;

Backtraces in PostgreSQL

2020-07-30T03:01:00.001-07:00

PostgreSQL 13 has introduced a simple but extremely useful capability to log a stack trace into the server logs when an error is reported. Let's see the details.

There is a GUC to enable stacktrace generation : backtrace_functions. Set it to a comma-separated function names.

SET backtrace_functions TO 'func1,func2';

If the error is thrown from one of these functions, a backtrace will be generated and logged into the server log.

Note that only superusers can set the backtrace_functions GUC. It can be set locally in a session, or can be included in postgresql.conf file to globally set it.

It's easy to see how it would help in a situation where a customer reports an error message. We can find from where it came from by grep'ing for it in the source code. But beyond that, it was all guess work. Not anymore. Now, you can ask the customer to set backtrace_functions to all such functions which are emitting this error message, and get the stack trace. In most cases, the root cause of the error is not in the function which emits the error; its located somewhere in the middle of the stack; hence the stack trace is critical.

This capability is already available in many other databases like MySQL, Greenplum, Oracle.

What's still missing in PostgreSQL - and is present in most of these other databases - is being able to generate stack trace when a server backend crashes with a segmentation fault or other such unexpected signals, or when the server PANICs due to some reason. This capability would make a much bigger difference. We will get rid of having to explain steps to generate core file. More importantly, this helps in situations where the crash happens only randomly. Even with a single unexpected crash, the customer would always be ready with a backtrace. I am hopeful this would be implemented in the next major release of PostgreSQL.

Let's see how a PostgreSQL stack trace log looks like. We will try to use a non-existent type to create a table. Supposing we know that the "type does not exist" error comes from typenameType() in the source code. So we do this :

postgres=# set backtrace_functions TO 'typenameType';
postgres=# create table tab (id invalidtype);
ERROR: type "invalidtype" does not exist
LINE 1: create table tab (id invalidtype);

Here's a snippet from the server log :
2020-07-28 20:17:01.482 CST [22454] ERROR: type "invalidtype" does not exist at character 22
2020-07-28 20:17:01.482 CST [22454] BACKTRACE:
    postgres: amit postgres [local] CREATE TABLE(typenameType+0xa4) [0xaaaaafcd2ac4]
    postgres: amit postgres [local] CREATE TABLE(+0x20f550) [0xaaaaafcd4550]
    postgres: amit postgres [local] CREATE TABLE(transformCreateStmt+0x53c) [0xaaaaafcd7a10]
    postgres: amit postgres [local] CREATE TABLE(+0x44df20) [0xaaaaaff12f20]
    postgres: amit postgres [local] CREATE TABLE(standard_ProcessUtility+0x16c) [0xaaaaaff1225c]
    postgres: amit postgres [local] CREATE TABLE(+0x44a4e4) [0xaaaaaff0f4e4]
    postgres: amit postgres [local] CREATE TABLE(+0x44af88) [0xaaaaaff0ff88]
    postgres: amit postgres [local] CREATE TABLE(PortalRun+0x198) [0xaaaaaff10ed8]
    postgres: amit postgres [local] CREATE TABLE(+0x44764c) [0xaaaaaff0c64c]
    postgres: amit postgres [local] CREATE TABLE(PostgresMain+0x970) [0xaaaaaff0d3d4]
    postgres: amit postgres [local] CREATE TABLE(+0x3b3be4) [0xaaaaafe78be4]
    postgres: amit postgres [local] CREATE TABLE(PostmasterMain+0xdc0) [0xaaaaafe79b70]
    postgres: amit postgres [local] CREATE TABLE(main+0x480) [0xaaaaafb82510]
    /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0xe0) [0xffffaac956e0]
    postgres: amit postgres [local] CREATE TABLE(+0xbd5d8) [0xaaaaafb825d8]
2020-07-29 18:01:02.726 CST [28776] STATEMENT: create table tab (id invalidtype);

Each line of the backtrace has the function name, an offset into that function, and the return address of that frame.

For some stack frames, the function name is not present; instead, the function address is present. These are static functions. For such functions, the function names are not exposed. But we may be able to get their names from their addresses, with the help of addr2line command-line tool :

$ addr2line 0x20f550 0x44df20 -a -f -e `which postgres`
0x000000000020f550
transformColumnDefinition
:?
0x000000000044df20
ProcessUtilitySlow.constprop.0
:?

If it's a debug build, even the file name and offset is printed.

Now let's see how this simple feature is implemented.

In most of the RDBMS's including PostgreSQL, Greenplum, MySQL, the feature is implemented using a simple function backtrace() to generate the stacktrace:

int backtrace(void **buffer, int size);

This function only returns all the return addresses of the frames. So it should be followed by a call to backtrace_symbols() that converts the addresses returned by backtrace() into strings that describe the addresses using the function names if available :

char **backtrace_symbols(void *const *buffer, int size);

All the details of these functions are nicely described in their man pages. These functions are available in most of the platforms.

Note a couple of points :

1. For the function names to be available for backtrace_symbols(), the executable has to be built using linker options that allow adding all these symbols into a "dynamic symbol table". These options can be given with one of the following ways (these are gcc compiler options) :
gcc -rdynamic
gcc -Wl,-E

2. Sometimes particular stack frames might be missing, when compiled with gcc -O2 or higher optimization level. E.g. check this sample program backtrace.c from the backtrace() man pages.

I compile it without -O2 :
amit:pg:error$ gcc -rdynamic -o backtrace backtrace.c
I get the full stack :
amit:pg:error$ ./backtrace 6
backtrace() returned 11 addresses
./backtrace(myfunc3+0x2c) [0xaaaad6b2edc0]
./backtrace(+0xe84) [0xaaaad6b2ee84]
./backtrace(myfunc+0x2c) [0xaaaad6b2eebc]
./backtrace(myfunc+0x24) [0xaaaad6b2eeb4]
./backtrace(myfunc+0x24) [0xaaaad6b2eeb4]
./backtrace(myfunc+0x24) [0xaaaad6b2eeb4]
./backtrace(myfunc+0x24) [0xaaaad6b2eeb4]
./backtrace(myfunc+0x24) [0xaaaad6b2eeb4]
./backtrace(main+0x60) [0xaaaad6b2ef28]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0xe0) [0xffff8c5ba6e0]
./backtrace(+0xcc4) [0xaaaad6b2ecc4]

Now I compile it with -O2 :
amit:pg:error$ gcc -O2 -rdynamic -o backtrace backtrace.c
amit:pg:error$ ./backtrace 6
backtrace() returned 4 addresses
./backtrace(myfunc3+0x38) [0xaaaac7183e40]
./backtrace(main+0x4c) [0xaaaac7183cfc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0xe0) [0xffffb91286e0]
./backtrace(+0xd38) [0xaaaac7183d38]

There is no frame for myfunc2() and myfunc(). One possibility is that the compiler has replaced the recursive calls of myfunc() and also myfunc2() call with the tail end call myfunc3(), which is called tail call optimization.

The point being: we need to be aware of such missing frames in a few scenarios.

Leveraging SIMD Vectorization

2020-06-23T10:23:00.000-07:00

With the advent of column store databases, there was an urge to make use of SIMD vector processing. It naturally fits into the way table data is arranged. Let's first briefly check what is SIMD. It stands for Single Instruction Multiple Data. Today, CPU instructions support this kind of mechanism where the same instruction can be executed simultaneously on multiple data elements. E.g. Say, you want to double all the column values. Or remove the red component of the RGB values of pixels of an image. For large data, these operations are CPU bottlenecks. So SIMD cuts the CPU time significantly by operating simultaneously on 2, 4, 8, 16 or 32 (or more) data elements depending on the size of each data element. So suppose we want to do "arr[i] *= 2" for each element of "int32 arr[]". Normally we would iterate through each of the elements for doing this operation. In the generated assembly code, MUL instruction will be run on each of the elements. With SIMD, we would arrange for loading 4 (or more) adjacent array elements into a 128-bit (or larger) CPU "vector" register, and then arrange for a "vectorized" version of the MUL instruction to be called using this register, and repeat this for each subsequent 4 element array section.

How do we arrange for generating such vectorized assembly instructions ? Well, one way is to write such an assembly code. But in most of the cases, we won't need this method, thanks to the below two methods :

1. Vectorization Intrinsics

For a programmer, an intrinsic is just like any other function call. Underneath, the compiler replaces it with an appropriate assembly instruction. So instead of having to deal with registers using assembly instruction inside C/C++ code, call the corresponding intrinsic function. Each CPU architecture has it's own set of intrinsics API, and corresponding header file. As an example, let's vectorize a snippet of PostgreSQL code using ARM architecture's SIMD intrinsics, to see how big a difference it makes by vectorizing things. Before that, you might want to quickly go through the NEON architecture to understand the naming conventions for registers, lanes and vectors. NEON is ARM's brand name for SIMD architecture. NEON unit is a mandatory part of ARMv8 chip.

Here is a PostgreSQL code snippet from the mul_var() function that is used to multiply two PostgreSQL NUMERIC data types. As of this writing, it looks like this :

for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2;
     i2 >= 0; i2--)
   dig[i--] += var1digit * var2digits[i2];

where, the variables are declared as :
int32 *dig;
int16 var1digit, *var2digits;

Here, you can see that the loop iterates i2+1 times. On each iteration, both i and i2 are decremented. That means, there is a fixed contiguous section of each of the two arrays where we want to repeatedly do the same arithmetic operation for every array element in this section. The arithmetic being done here is : multiply two int16 variables, and add up that product into an int32 variable. An assembly instruction is available which exactly does that : VMLA. The corresponding intrinsic is : vmlal_s16()

Let's first simplify the above backward for-loop into an equivalent forward loop :

i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
count = i2 + 1;
digptr = &dig[i1 + 2];
for (i = 0; i < count; i++)
   digptr[i] += var1digit * var2digits[i];

So we want to vectorize the above multiply+accumulate statement. We have this intrinsic :
int16x8_t   vmlaq_s16(int16x8_t a, int16x8_t b, int16x8_t c);
This does a+(b*c) and returns the result. a, b and c are vectors. The type int16x8_t signifies that the vector is in a 128-bit NEON register having 8 lanes, each lane having 16-bit signed integers. So vmlaq_s16() does the multiply+accumulate operation on all 8 lanes of the 3 vectors in parallel, and returns the 8 result values again in a int16x8_t vector. Each multiple+accumulate is contained in one particular lane of all the 3 vectors.
To avoid overflow, as can be seen in the above C snippet, the multiplication is accumulated into a 32-bit integer. So instead of vmlaq_s16(), we have to use an intrinsic that operates on 16-bit values and returns 32bit values :
int32x4_t vmlal_s16(int32x4_t a, int16x4_t b, int16x4_t c);
Since only 4 32-bit data elements can be accommodated in a 128-bit vector, 4 elements could be parallelized rather than 8.

As can be seen, all these operations use the 128-bit registers, even though they need not be fully occupied, as in the case with int16x4 vectors. We need to first load the C array element values into these registers, and in the end, store the resultant values back from the registers into the result array elements. We have intrinsics for that also. Although there are intrinsics that operate on a mix of scalar and vectors, the intrinsic used above uses only vectors. So the same var1digit value can be loaded into all 4 lanes of a 16x4 vector.

With these instrinsics, the final code looks like this :

#include <arm_neon.h>
......
......
int i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
int remainder;
int count = i2 + 1;
int32 *digptr = &dig[i1 + 2];

/* Load the same var1digit value into all lanes of 16x4 vector. */
int16x4_t   var1digit_16x4 = vdup_n_s16(var1digit);     // VDUP.16 d0,r0

/* Parallelize each group of 4 digits */
remainder = count%4;
count -= remainder;
for (i = 0; i < count; i += 4)
{
    /*
     * 1. Load required data into vectors
     * 2. Do multiply-accumulate-long operation using 16x4 vectors,
     *    whose output is a 32x4 vector which we need, because digptr[]
     *    is 32bit.
     * 3. Store back the result vector into digptr[]
     */

    /* Load 4 var2digits into 16x4 vector and digptr into 32x4 */
    int16x4_t    var2digits_16x4 = vld1_s16(&var2digits[i]);
    int32x4_t    dig_32x4 = vld1q_s32(&digptr[i]);

    /* Vector multiply-accumulate-long: vmlal_<type>. Vr[i] := Va[i] + Vb[i] * Vc[i] */
    dig_32x4 = vmlal_s16(dig_32x4, var1digit_16x4, var2digits_16x4);

    /* Store back the result into &digptr[i] */
    vst1q_s32(&digptr[i], dig_32x4);
}

/* Do the last remaining digits */
for (; remainder != 0; remainder--, i++)
    digptr[i] += var1digit * var2digits[i];

I created a schema that contains numerics with large precisions, as shown here, and ran the following query that multiplies t1.val and t2.val. With the non-vectorized code, the execution time showed .874 milliseconds :
$ psql -c "explain analyze SELECT t1.id, t2.id, t1.val * t2.val FROM num_data t1, num_data t2"
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..1039.85 rows=67600 width=40) (actual time=0.016..0.840 rows=100 loops=1)
   -> Seq Scan on num_data t1 (cost=0.00..12.60 rows=260 width=275) (actual time=0.003..0.004 rows=10 loops=1)
   -> Materialize (cost=0.00..13.90 rows=260 width=275) (actual time=0.001..0.002 rows=10 loops=10)
         -> Seq Scan on num_data t2 (cost=0.00..12.60 rows=260 width=275) (actual time=0.001..0.002 rows=10 loops=1)
Planning Time: 0.156 ms
Execution Time: 0.874 ms
(6 rows)

With the above vectorized code, the same query execution time is now .360 ms, i.e. more than 2x speedup :

$ psql -c "explain analyze SELECT t1.id, t2.id, t1.val * t2.val FROM num_data t1, num_data t2"
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..1039.85 rows=67600 width=40) (actual time=0.016..0.322 rows=100 loops=1)
   -> Seq Scan on num_data t1 (cost=0.00..12.60 rows=260 width=275) (actual time=0.007..0.008 rows=10 loops=1)
   -> Materialize (cost=0.00..13.90 rows=260 width=275) (actual time=0.001..0.002 rows=10 loops=10)
         -> Seq Scan on num_data t2 (cost=0.00..12.60 rows=260 width=275) (actual time=0.001..0.002 rows=10 loops=1)
Planning Time: 0.169 ms
Execution Time: 0.360 ms
(6 rows)

Since individual digits of the number have to be multiplied by the digits of the other number, the benefit is more for numerics with large precision. The schema I created has values with precisions in the range of 200-600. But the benefit starts showing up from around 20 precision onwards, with my ARM64 VM.

2. Auto-vectorization

It's not always necessary to write code that uses intrinsics. Often if we arrange/simplify the code, today's compilers, with appropriate compiler options, try to identify if the code can be vectorized, and generate appropriate assembly instructions that leverage the CPU architecture's SIMD. In fact, above where I simplified the backward for-loop to a forward for-loop that uses a single variable increment, the gcc compiler is able to auto-vectorize the simplified for-loop. Here are the changes again:

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index f3a725271e..4243242ad9 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -7226,6 +7226,7 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
      int                res_weight;
      int                maxdigits;
      int            *dig;
+      int            *digptr;
      int                carry;
      int                maxdig;
      int                newdig;
@@ -7362,10 +7363,14 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
             *
             * As above, digits of var2 can be ignored if they don't contribute,
             * so we only include digits for which i1+i2+2 <= res_ndigits - 1.
+            *
+            * For large precisions, this can become a bottleneck; so keep this for
+            * loop simple so that it can be auto-vectorized.
             */
-            for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2;
-                  i2 >= 0; i2--)
-                  dig[i--] += var1digit * var2digits[i2];
+            i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
+            digptr = &dig[i1 + 2];
+            for (i = 0; i <= i2; i++)
+                  digptr[i] += var1digit * var2digits[i];
      }

With this change, in mul_var() assembly code, I could see the multiply-accumulate instructions that operate on NEON vectors (these are arm64 instructions) :
    smlal   v1.4s, v2.4h, v3.4h
    smlal2 v0.4s, v2.8h, v3.8h

gcc compiler option to enable auto-vectorization is "-ftree-loop-vectorize". With gcc -O3, it is always enabled.

Although there are examples where gcc is able to auto-vectorize even backward loops, in the above case, it could not do so for the original code, seemingly because of two decrementing variables. That's why I had to simplify it to a forward loop with a single variable increment, which is as simple as it gets.

To check whether gcc has been able to vectorize a particular code, use the gcc -fopt-info-all option. This outputs info such as this :
numeric.c:7217:3: optimized: loop vectorized using 16 byte vectors
Or in case it can't vectorize, you would see something like this :
numeric.c:7380:3: missed: couldn't vectorize loop
numeric.c:7381:15: missed: not vectorized: relevant stmt not supported: _39 = *_38;

With this auto-vectorization method, the speedup I observed was around 2.7x. This speedup is higher than the intrinsics method, probably because the compiler might have used a better combination of assembly vectorized instructions than I did.

Conclusion

Vectorizing operations gives significant returns in repetitive operations. Although it suits well for columnar data, there could be some regions in current PostgreSQL code that might benefit from such tweaks to leverage SIMD. As far as possible, we should arrange for the compiler's auto-vectorization. Such change is cleaner and clearly portable. Compare this with method 1 where we had to use intrinsics specific to the CPU architecture. But that example was chosen for the sake of explaining how to make use of intrinsics. In cases where it is not possible for the compiler to vectorize the code, we should use compiler intrinsics. E.g. check this out.

PostgreSQL on ARM

2020-05-21T19:21:00.000-07:00

In my last blog, I wrote that applications that have been running on x86 might need to undergo some adaptation if they are to be run on a different architecture such as ARM. Let's see what it means exactly.

Recently I have been playing around with PostgreSQL RDBMS using an ARM64 machine. A few months back, I even didn't know whether it can be compiled on ARM, being oblivious of the fact that we already have a regular build farm member for ARM64 for quite a while. And now even the PostgreSQL apt repository has started making PostgreSQL packages available for ARM64 architecture. But the real confidence on the reliability of PostgreSQL-on-ARM came after I tested it with different kinds of scenarios.

I started with read-only pgbench tests and compared the results on the x86_64 and the ARM64 VMs available to me. The aim was not to compare any specific CPU implementation. The idea was to find out scenarios where PostgreSQL on ARM does not perform in one scenario as good as it performs in other scenarios, when compared to x86.

Test Configuration

ARM64 VM:
   Ubuntu 18.04.3; 8 CPUs; CPU frequency: 2.6 GHz; available RAM : 11GB
x86_64 VM:
   Ubuntu 18.04.3; 8 CPUs; CPU frequency: 3.0 GHz; available RAM : 11GB

Following was common for all tests :

PostgreSQL parameters changed : shared_buffers = 8GB
pgbench scale factor : 30
pgbench command :
for num in 2 4 6 8 10 12 14
do
   pgbench [-S] -c $num -j $num -M prepared -T 40
done
What it means is : pgbench is run with increasing number of parallel clients, starting from 2 to 14.

Select-only workload

pgbench -S option is used for read-only workload.

Between 2 and 4 threads, the x86 performance is 30% more than ARM, and the difference rises more and more. Between 4 and 6, the curves flatten a bit, and between 6 and 8, the curves suddenly become steep. After 8, it was expected to flatten out or dip, because the machines had 8 CPUs. But there is more to it. The pgbench clients were running on the same machines where servers were installed. And with fully utilized CPUs, the clients took around 20% of the CPUs. So they start to interfere from 6 threads onward. In spite of that, there is a steep rise between 6 and 8, for both ARM and x86. This is not yet understood by me, but possibly it has something to do with the Linux scheduler, and the interaction between the pgbench clients and the servers. Note that, the curve shape is mostly similar on both x86 and ARM. So this behaviour is not specific to architectures. One difference in the curves, though, is : the ARM curve has a bit bigger dip from 8 threads onward. Also, betweeen 6 and 8, the sudden jump in transactions is not that steep for ARM compared to x86. So the end result in this scenario is : As the CPUs become more and more busy, PostgreSQL on ARM lags behind x86 more and more. Let's see what happens if we remove the interference created by pgbench clients.

select exec_query_in_loop(n)

So, to get rid of the noise occurring because of both client and server on the same machines, I arranged for testing exactly what I intended to test: query performance. For this, pgbench clients can run on different machines, but that might create a different noise: network latency. So instead, I wrote a PostgreSQL C language user-defined function that keeps on executing in a loop the same exact SQL query that is run by this pgbench test. Execute this function using the pgbench custom script. Now, pgbench clients would be mostly idle. Also, this won't take into account the commit/rollback time, because most of the time will be spent inside the C function.

pgbench custom script : select exec_query_in_loop(n);
where n is the number of times the pgbench query will be executed on the server in a loop.
The loop query is the query that gets normally executed with pgbench -S option:
SELECT abalance FROM pgbench_accounts WHERE aid = $1
Check details in exec_query_in_loop()

Now, you see a very different curve. For both curves, upto 8 threads, transactions rate is linearly proportional to number of threads. After 8, as expected, the transactions rate doesn't rise. And it has not dipped, even for ARM. PostgreSQL is consistently around 35% slower on x86 compared to ARM. This sounds not that bad when we consider that the ARM CPU frequency is 2.6 GHz whereas x86 is 3.0 Gz. Note that the transaction rate is single digit, because the function exec_query_in_loop(n) is executed with n=100000.

This experiment also shows that the previous results using built-in pgbench script have to do with pgbench client interference. And that, the dip in curve for ARM for contended threads is not caused by the contention in the server. Note that, the transactions rates are calculated at client side. So even when a query is ready for the results, there may be some delay in the client requesting the results , calculating the timestamp, etc, especially in high contention scenarios.

select exec_query_in_loop(n) - PLpgSQL function

Before using the user-defined C function, I had earlier used a PL/pgSQL function to do the same work. There, I stumbled across a different kind of performance behaviour.

Here, PostgreSQL on ARM is around 65% slower than on x86, regardless of number of threads. Comparing with the previous results that used a C function, it is clear that PL/pgSQL execution is remarkably slower on ARM, for some reason. I checked the perf report, but more or less the same hotspot functions are seen in both ARM and x86. But for some reason, anything executed inside PL/pgSQL function becomes much slower on ARM than on x86.

I am yet to check the cache misses to see if those are more on ARM. As of this writing, what I did was this (some PostgreSQL-internals here) : exec_stmt_foreach_a() calls exec_stmt(). I cloned exec_stmt() to exec_stmt_clone(), and made exec_stmt_foreach_a() call exec_stmt_clone() instead. This sped up the overall execution, but it sped up 20% more for ARM. Why just this change caused this behaviour is kind of a mystery to me as of now. May be it has to do with the location of a function in the program; not sure.

Updates

The default pgbench option runs the tpcb-like built-in script, which has some updates on multiple tables.

Here, the transaction rate is only around 1-10% percent less on ARM compared to x86. This is probably because major portion of the time goes in waiting for locks, and in disk writes during commits. And the disks I used are non-SSD disks. But overall it looks like, updates on PostgreSQL are working good on ARM.

Next thing, I am going to test with aggregate queries, partitions, high number of CPUs (32/64/128), larger RAM and higher scale factor, to relatively see how PostgreSQL scales on the two platforms with large resources.

Conclusion

We saw that PostgreSQL RDBMS works quite robustly on ARM64. While it is tricky to compare the performance on two different platforms, we could still identify which areas it is not doing good by comparing patterns of behaviour in different scenarios in the two platforms.