Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

Eric Biggers <ebiggers@xxxxxxxxxx> · Wed, 4 Nov 2020 09:57:42 -0800

On Tue, Nov 03, 2020 at 08:15:06PM +0800, l00374334 wrote:
> From: liqiang <liqiang64@xxxxxxxxxx>
> 
> 	In the libz library, the checksum algorithm adler32 usually occupies
> 	a relatively high hot spot, and the SVE instruction set can easily
> 	accelerate it, so that the performance of libz library will be
> 	significantly improved.
> 
> 	We can divides buf into blocks according to the bit width of SVE,
> 	and then uses vector registers to perform operations in units of blocks
> 	to achieve the purpose of acceleration.
> 
> 	On machines that support ARM64 sve instructions, this algorithm is
> 	about 3~4 times faster than the algorithm implemented in C language
> 	in libz. The wider the SVE instruction, the better the acceleration effect.
> 
> 	Measured on a Taishan 1951 machine that supports 256bit width SVE,
> 	below are the results of my measured random data of 1M and 10M:
> 
> 		[root@xxx adler32]# ./benchmark 1000000
> 		Libz alg: Time used:    608 us, 1644.7 Mb/s.
> 		SVE  alg: Time used:    166 us, 6024.1 Mb/s.
> 
> 		[root@xxx adler32]# ./benchmark 10000000
> 		Libz alg: Time used:   6484 us, 1542.3 Mb/s.
> 		SVE  alg: Time used:   2034 us, 4916.4 Mb/s.
> 
> 	The blocks can be of any size, so the algorithm can automatically adapt
> 	to SVE hardware with different bit widths without modifying the code.
> 
> 
> Signed-off-by: liqiang <liqiang64@xxxxxxxxxx>

Note that this patch does nothing to actually wire up the kernel's copy of libz
(lib/zlib_{deflate,inflate}/) to use this implementation of Adler32.  To do so,
libz would either need to be changed to use the shash API, or you'd need to
implement an adler32() function in lib/crypto/ that automatically uses an
accelerated implementation if available, and make libz call it.

Also, in either case a C implementation would be required too.  There can't be
just an architecture-specific implementation.

Also as others have pointed out, there's probably not much point in having a SVE
implementation of Adler32 when there isn't even a NEON implementation yet.  It's
not too hard to implement Adler32 using NEON, and there are already several
permissively-licensed NEON implementations out there that could be used as a
reference, e.g. my implementation using NEON instrinsics here:
https://github.com/ebiggers/libdeflate/blob/v1.6/lib/arm/adler32_impl.h

- Eric