> > Also, I wonder if we shouldn't simply change the chacha code to use > > unaligned loads for the state array, as it likely makes very little > > difference in practice (the state is not accessed from inside the > > round processing loop) > > I am seeing a 0.25% slowdown on 1k blocks in the SSE3 code with the > change below: [...] > > AVX2 and AVX512 uses vbroadcasti128 with memory operands to load the > state, so they don't require any changes afaik. I agree. Moving SSE to use unaligned loads is certainly acceptable these days. Some AVX functions use vpbroadcastd with u32 load granularity anyway. Some use vbroadcasti128 that theoretically could (?) suffer somewhat when operating on unaligned data, but it I guess it won't justify all that alignment cruft. Regards, Martin