Re: [PATCH riscv/for-next] crypto: riscv - parallelize AES-CBC decryption

Jerry Shih <jerry.shih@xxxxxxxxxx> · Mon, 26 Feb 2024 09:40:14 +0800

On Feb 11, 2024, at 02:12, Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
> On Sat, Feb 10, 2024 at 11:25:27PM +0800, Jerry Shih wrote:
>>> .macro	aes_cbc_decrypt	keylen
>>> +	srli		LEN, LEN, 2	// Convert LEN from bytes to words
>>> 	vle32.v		v16, (IVP)	// Load IV
>>> 1:
>>> -	vle32.v		v17, (INP)	// Load ciphertext block
>>> -	vmv.v.v		v18, v17	// Save ciphertext block
>>> -	aes_decrypt	v17, \keylen	// Decrypt
>>> -	vxor.vv		v17, v17, v16	// XOR with IV or prev ciphertext block
>>> -	vse32.v		v17, (OUTP)	// Store plaintext block
>>> -	vmv.v.v		v16, v18	// Next "IV" is prev ciphertext block
>>> -	addi		INP, INP, 16
>>> -	addi		OUTP, OUTP, 16
>>> -	addi		LEN, LEN, -16
>>> +	vsetvli		t0, LEN, e32, m4, ta, ma
>>> +	vle32.v		v20, (INP)	// Load ciphertext blocks
>>> +	vslideup.vi	v16, v20, 4	// Setup prev ciphertext blocks
>>> +	addi		t1, t0, -4
>>> +	vslidedown.vx	v24, v20, t1	// Save last ciphertext block
>> 
>> Do we need to setup the `e32, len=t0` for next IV?
>> I think we only need 128bit IV (with VL=4).
>> 
>>> +	aes_decrypt	v20, \keylen	// Decrypt the blocks
>>> +	vxor.vv		v20, v20, v16	// XOR with prev ciphertext blocks
>>> +	vse32.v		v20, (OUTP)	// Store plaintext blocks
>>> +	vmv.v.v		v16, v24	// Next "IV" is last ciphertext block
>> 
>> Same VL issue here.
> 
> It's true that the vslidedown.vx and vmv.v.v only need vl=4.  But it also works
> fine with vl unchanged.  It just results in some extra data being moved in the
> registers.  My hypothesis is that this is going to be faster than having the
> three extra instructions per loop iteration to change the vl to 4 twice.
> 
> I still have no real hardware to test on, so I have no quantitative data.  All I
> can do is go with my instinct which is that the shorter version will be better.
> 
> If you have access to a real CPU that supports the RISC-V vector crypto
> extensions, I'd be interested in the performance you get from each variant.
> (Of course, different RISC-V CPU implementations may have quite different
> performance characteristics, so that still won't be definitive.)

Hi Eric,
Thank you. I think the extra vl doesn't affect performance significantly. The main
tasks are still the aes body.
The original implementation is enough right now.

> In general, this level of micro-optimization probably needs to be wait until
> there are a variety of CPUs to test on.  We know that parallelizing the
> algorithms is helpful, so we should do that, as this patch does.  But the
> effects of small variations in the instruction sequences are currently unclear.
> 
> - Eric