On Feb 11, 2024, at 02:12, Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > On Sat, Feb 10, 2024 at 11:25:27PM +0800, Jerry Shih wrote: >>> .macro aes_cbc_decrypt keylen >>> + srli LEN, LEN, 2 // Convert LEN from bytes to words >>> vle32.v v16, (IVP) // Load IV >>> 1: >>> - vle32.v v17, (INP) // Load ciphertext block >>> - vmv.v.v v18, v17 // Save ciphertext block >>> - aes_decrypt v17, \keylen // Decrypt >>> - vxor.vv v17, v17, v16 // XOR with IV or prev ciphertext block >>> - vse32.v v17, (OUTP) // Store plaintext block >>> - vmv.v.v v16, v18 // Next "IV" is prev ciphertext block >>> - addi INP, INP, 16 >>> - addi OUTP, OUTP, 16 >>> - addi LEN, LEN, -16 >>> + vsetvli t0, LEN, e32, m4, ta, ma >>> + vle32.v v20, (INP) // Load ciphertext blocks >>> + vslideup.vi v16, v20, 4 // Setup prev ciphertext blocks >>> + addi t1, t0, -4 >>> + vslidedown.vx v24, v20, t1 // Save last ciphertext block >> >> Do we need to setup the `e32, len=t0` for next IV? >> I think we only need 128bit IV (with VL=4). >> >>> + aes_decrypt v20, \keylen // Decrypt the blocks >>> + vxor.vv v20, v20, v16 // XOR with prev ciphertext blocks >>> + vse32.v v20, (OUTP) // Store plaintext blocks >>> + vmv.v.v v16, v24 // Next "IV" is last ciphertext block >> >> Same VL issue here. > > It's true that the vslidedown.vx and vmv.v.v only need vl=4. But it also works > fine with vl unchanged. It just results in some extra data being moved in the > registers. My hypothesis is that this is going to be faster than having the > three extra instructions per loop iteration to change the vl to 4 twice. > > I still have no real hardware to test on, so I have no quantitative data. All I > can do is go with my instinct which is that the shorter version will be better. > > If you have access to a real CPU that supports the RISC-V vector crypto > extensions, I'd be interested in the performance you get from each variant. > (Of course, different RISC-V CPU implementations may have quite different > performance characteristics, so that still won't be definitive.) Hi Eric, Thank you. I think the extra vl doesn't affect performance significantly. The main tasks are still the aes body. The original implementation is enough right now. > In general, this level of micro-optimization probably needs to be wait until > there are a variety of CPUs to test on. We know that parallelizing the > algorithms is helpful, so we should do that, as this patch does. But the > effects of small variations in the instruction sequences are currently unclear. > > - Eric