Tuesday, 23 June 2020

Leveraging SIMD Vectorization

With the advent of column store databases, there was an urge to make use of SIMD vector processing. It naturally fits into the way table data is arranged. Let's first briefly check what is SIMD. It stands for Single Instruction Multiple Data. Today, CPU instructions support this kind of mechanism where the same instruction can be executed simultaneously on multiple data elements. E.g. Say, you want to double all the column values. Or remove the red component of the RGB values of pixels of an image. For large data, these operations are CPU bottlenecks. So SIMD cuts the CPU time significantly by operating simultaneously on 2, 4, 8, 16 or 32 (or more) data elements depending on the size of each data element. So suppose we want to do "arr[i] *= 2" for each element of "int32 arr[]". Normally we would iterate through each of the elements for doing this operation. In the generated assembly code, MUL instruction will be run on each of the elements. With SIMD, we would arrange for loading 4 (or more) adjacent array elements into a 128-bit (or larger) CPU "vector" register, and then arrange for a "vectorized" version of the MUL instruction to be called using this register, and repeat this for each subsequent 4 element array section.

How do we arrange for generating such vectorized assembly instructions ? Well, one way is to write such an assembly code. But in most of the cases, we won't need this method, thanks to the below two methods :

1. Vectorization Intrinsics

For a programmer, an intrinsic is just like any other function call. Underneath, the compiler replaces it with an appropriate assembly instruction. So instead of having to deal with registers using assembly instruction inside C/C++ code, call the corresponding intrinsic function. Each CPU architecture has it's own set of intrinsics API, and corresponding header file. As an example, let's vectorize a snippet of PostgreSQL code using ARM architecture's SIMD intrinsics, to see how big a difference it makes by vectorizing things. Before that, you might want to quickly go through the NEON architecture to understand the naming conventions for registers, lanes and vectors. NEON is ARM's brand name for SIMD architecture. NEON unit is a mandatory part of ARMv8 chip.

Here is a PostgreSQL code snippet from the mul_var() function that is used to multiply two PostgreSQL NUMERIC data types. As of this writing, it looks like this :

for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2;
     i2 >= 0; i2--)
   dig[i--] += var1digit * var2digits[i2];

where, the variables are declared as :
int32 *dig;
int16  var1digit, *var2digits;

Here, you can see that the loop iterates i2+1 times. On each iteration, both i and i2 are decremented. That means, there is a fixed contiguous section of each of the two arrays where we want to repeatedly do the same arithmetic operation for every array element in this section. The arithmetic being done here is : multiply two int16 variables, and add up that product into an int32 variable. An assembly instruction is available which exactly does that : VMLA. The corresponding intrinsic is : vmlal_s16()

Let's first simplify the above backward for-loop into an equivalent forward loop :

i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
count = i2 + 1;
digptr = &dig[i1 + 2];
for (i = 0; i < count; i++)
   digptr[i] += var1digit * var2digits[i];

So we want to vectorize the above multiply+accumulate statement. We have this intrinsic :
int16x8_t   vmlaq_s16(int16x8_t a, int16x8_t b, int16x8_t c);
This does a+(b*c) and returns the result. a, b and c are vectors. The type int16x8_t signifies that the vector is in a 128-bit NEON register having 8 lanes, each lane having 16-bit signed integers. So vmlaq_s16() does the multiply+accumulate operation on all 8 lanes of the 3 vectors in parallel, and returns the 8 result values again in a int16x8_t vector. Each multiple+accumulate is contained in one particular lane of all the 3 vectors.
To avoid overflow, as can be seen in the above C snippet, the multiplication is accumulated into a 32-bit integer. So instead of vmlaq_s16(), we have to use an intrinsic that operates on 16-bit values and returns 32bit values :
int32x4_t  vmlal_s16(int32x4_t a, int16x4_t b, int16x4_t c);
Since only 4 32-bit data elements can be accommodated in a 128-bit vector, 4 elements could be parallelized rather than 8.

As can be seen, all these operations use the 128-bit registers, even though they need not be fully occupied, as in the case with int16x4 vectors. We need to first load the C array element values into these registers, and in the end, store the resultant values back from the registers into the result array elements. We have intrinsics for that also. Although there are intrinsics that operate on a mix of scalar and vectors, the intrinsic used above uses only vectors. So the same var1digit value can be loaded into all 4 lanes of a 16x4 vector.

With these instrinsics, the final code looks like this :

#include <arm_neon.h>
......
......
int i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
int remainder;
int count = i2 + 1;
int32 *digptr = &dig[i1 + 2];

/* Load the same var1digit value into all lanes of 16x4 vector. */
int16x4_t   var1digit_16x4 = vdup_n_s16(var1digit);     // VDUP.16 d0,r0

/* Parallelize each group of 4 digits */
remainder = count%4;
count -= remainder;
for (i = 0; i < count; i += 4)
{
    /*
     * 1. Load required data into vectors
     * 2. Do multiply-accumulate-long operation using 16x4 vectors,
     *    whose output is a 32x4 vector which we need, because digptr[]
     *    is 32bit.
     * 3. Store back the result vector into digptr[]
     */

    /* Load 4 var2digits into 16x4 vector and digptr into 32x4 */
    int16x4_t    var2digits_16x4 = vld1_s16(&var2digits[i]);
    int32x4_t    dig_32x4 = vld1q_s32(&digptr[i]);

    /* Vector multiply-accumulate-long: vmlal_<type>. Vr[i] := Va[i] + Vb[i] * Vc[i] */
    dig_32x4 = vmlal_s16(dig_32x4, var1digit_16x4, var2digits_16x4);

    /* Store back the result into &digptr[i] */
    vst1q_s32(&digptr[i], dig_32x4);
}

/* Do the last remaining digits */
for (; remainder != 0; remainder--, i++)
    digptr[i] += var1digit * var2digits[i];


I created a schema that contains numerics with large precisions, as shown here, and ran the following query that multiplies t1.val and t2.val. With the non-vectorized code, the execution time showed .874 milliseconds :
$ psql -c "explain analyze SELECT t1.id, t2.id, t1.val * t2.val FROM num_data t1, num_data t2"
                                                      QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..1039.85 rows=67600 width=40) (actual time=0.016..0.840 rows=100 loops=1)
   ->  Seq Scan on num_data t1  (cost=0.00..12.60 rows=260 width=275) (actual time=0.003..0.004 rows=10 loops=1)
   ->  Materialize  (cost=0.00..13.90 rows=260 width=275) (actual time=0.001..0.002 rows=10 loops=10)
         ->  Seq Scan on num_data t2  (cost=0.00..12.60 rows=260 width=275) (actual time=0.001..0.002 rows=10 loops=1)
 Planning Time: 0.156 ms
 Execution Time: 0.874 ms
(6 rows)

With the above vectorized code, the same query execution time is now .360 ms, i.e. more than 2x speedup :

$ psql -c "explain analyze SELECT t1.id, t2.id, t1.val * t2.val FROM num_data t1, num_data t2"
                                                      QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..1039.85 rows=67600 width=40) (actual time=0.016..0.322 rows=100 loops=1)
   ->  Seq Scan on num_data t1  (cost=0.00..12.60 rows=260 width=275) (actual time=0.007..0.008 rows=10 loops=1)
   ->  Materialize  (cost=0.00..13.90 rows=260 width=275) (actual time=0.001..0.002 rows=10 loops=10)
         ->  Seq Scan on num_data t2  (cost=0.00..12.60 rows=260 width=275) (actual time=0.001..0.002 rows=10 loops=1)
 Planning Time: 0.169 ms
 Execution Time: 0.360 ms
(6 rows)


Since individual digits of the number have to be multiplied by the digits of the other number, the benefit is more for numerics with large precision. The schema I created has values with precisions in the range of 200-600. But the benefit starts showing up from around 20 precision onwards, with my ARM64 VM.


2. Auto-vectorization

It's not always necessary to write code that uses intrinsics. Often if we arrange/simplify the code, today's compilers, with appropriate compiler options, try to identify if the code can be vectorized, and generate appropriate assembly instructions that leverage the CPU architecture's SIMD. In fact, above where I simplified the backward for-loop to a forward for-loop that uses a single variable increment, the gcc compiler is able to auto-vectorize the simplified for-loop. Here are the changes again:

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index f3a725271e..4243242ad9 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -7226,6 +7226,7 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
      int                res_weight;
      int                maxdigits;
      int            *dig;
+      int            *digptr;
      int                carry;
      int                maxdig;
      int                newdig;
@@ -7362,10 +7363,14 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
             *
             * As above, digits of var2 can be ignored if they don't contribute,
             * so we only include digits for which i1+i2+2 <= res_ndigits - 1.
+            *
+            * For large precisions, this can become a bottleneck; so keep this for
+            * loop simple so that it can be auto-vectorized.
             */
-            for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2;
-                  i2 >= 0; i2--)
-                  dig[i--] += var1digit * var2digits[i2];
+            i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
+            digptr = &dig[i1 + 2];
+            for (i = 0; i <= i2; i++)
+                  digptr[i] += var1digit * var2digits[i];
      }

With this change, in mul_var() assembly code, I could see the multiply-accumulate instructions that operate on NEON vectors (these are arm64 instructions) :
    smlal   v1.4s, v2.4h, v3.4h
    smlal2  v0.4s, v2.8h, v3.8h

gcc compiler option to enable auto-vectorization is "-ftree-loop-vectorize". With gcc -O3, it is always enabled.

Although there are examples where gcc is able to auto-vectorize even backward loops, in the above case, it could not do so for the original code, seemingly because of two decrementing variables. That's why I had to simplify it to a forward loop with a single variable increment, which is as simple as it gets.

To check whether gcc has been able to vectorize a particular code, use the  gcc -fopt-info-all option. This outputs info such as this :
numeric.c:7217:3: optimized: loop vectorized using 16 byte vectors
Or in case it can't vectorize, you would see something like this :
numeric.c:7380:3: missed: couldn't vectorize loop
numeric.c:7381:15: missed: not vectorized: relevant stmt not supported: _39 = *_38;

With this auto-vectorization method, the speedup I observed was around 2.7x. This speedup is higher than the intrinsics method, probably because the compiler might have used a better combination of assembly vectorized instructions than I did.

Conclusion

Vectorizing operations gives significant returns in repetitive operations. Although it suits well for columnar data, there could be some regions in current PostgreSQL code that might benefit from such tweaks to leverage SIMD.  As far as possible, we should arrange for the compiler's auto-vectorization. Such change is cleaner and clearly portable. Compare this with method 1 where we had to use intrinsics specific to the CPU architecture. But that example was chosen for the sake of explaining how to make use of intrinsics. In cases where it is not possible for the compiler to vectorize the code, we should use compiler intrinsics. E.g. check this out.

No comments:

Post a comment

Backtraces in PostgreSQL

PostgreSQL 13 has introduced a simple but extremely useful capability to log a stack trace into the server logs when an error is reported. L...