On 20 December 2017 at 20:52, Arnd Bergmann <arnd@xxxxxxxx> wrote: > While testing other changes, I discovered that gcc-7.2.1 produces badly > optimized code for aes_encrypt/aes_decrypt. This is especially true when > CONFIG_UBSAN_SANITIZE_ALL is enabled, where it leads to extremely > large stack usage that in turn might cause kernel stack overflows: > > crypto/aes_generic.c: In function 'aes_encrypt': > crypto/aes_generic.c:1371:1: warning: the frame size of 4880 bytes is larger than 2048 bytes [-Wframe-larger-than=] > crypto/aes_generic.c: In function 'aes_decrypt': > crypto/aes_generic.c:1441:1: warning: the frame size of 4864 bytes is larger than 2048 bytes [-Wframe-larger-than=] > > I verified that this problem exists on all architectures that are > supported by gcc-7.2, though arm64 in particular is less affected than > the others. I also found that gcc-7.1 and gcc-8 do not show the extreme > stack usage but still produce worse code than earlier versions for this > file, apparently because of optimization passes that generally provide > a substantial improvement in object code quality but understandably fail > to find any shortcuts in the AES algorithm. > > Turning off the tree-pre and tree-sra optimization steps seems to > reverse the effect, and could be used as a workaround in case we > don't get a good gcc fix. > > Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356 > Cc: Richard Biener <rguenther@xxxxxxx> > Cc: Jakub Jelinek <jakub@xxxxxxxxxxx> > Signed-off-by: Arnd Bergmann <arnd@xxxxxxxx> > --- > Jakub and Richard have done a more detailed analysis of this, and are > working on ways to improve the code again. In the meantime, I'm sending > this workaround to the Linux crypto maintainers to make them aware of > this issue and for testing. > > What would be a good way to test the performance of the AES code with > the various combinations of compiler versions, as well as UBSAN and this > patch enabled or disabled? You can use the tcrypt.ko module to benchmark AES. modprobe tcrypt mode=200 sec=1 to run a (lengthy) AES benchmark in various modes. AES-128 in ECB mode using the largest block size tested is what I usually use for comparison. On my Cortex-A57, the generic AES code runs at ~18 cycles per byte. Note that we have alternative scalar implementations on ARM and arm64 that are faster so the performance of aes-generic is not really relevant (and so it is essentially dead code) > --- > crypto/aes_generic.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/crypto/aes_generic.c b/crypto/aes_generic.c > index ca554d57d01e..35f973ba9878 100644 > --- a/crypto/aes_generic.c > +++ b/crypto/aes_generic.c > @@ -1331,6 +1331,20 @@ EXPORT_SYMBOL_GPL(crypto_aes_set_key); > f_rl(bo, bi, 3, k); \ > } while (0) > > +#if __GNUC__ >= 7 > +/* > + * Newer compilers try to optimize integer arithmetic more aggressively, > + * which generally improves code quality a lot, but in this specific case > + * ends up hurting more than it helps, in some configurations drastically > + * so. This turns off two optimization steps that have been shown to > + * lead to rather badly optimized code with gcc-7. > + * > + * See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356 > + */ > +#pragma GCC optimize("-fno-tree-pre") > +#pragma GCC optimize("-fno-tree-sra") > +#endif > + > static void aes_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in) > { > const struct crypto_aes_ctx *ctx = crypto_tfm_ctx(tfm); > -- > 2.9.0 >