RE: [PATCH] crypto: x86: Do not acquire fpu context for too long

"Elliott, Robert (Servers)" <elliott@xxxxxxx> · Sat, 8 Oct 2022 19:48:07 +0000

> -----Original Message-----
> From: Ard Biesheuvel <ardb@xxxxxxxxxx>
> Sent: Friday, October 7, 2022 4:55 PM
> To: Elliott, Robert (Servers) <elliott@xxxxxxx>
> Cc: Taehee Yoo <ap420073@xxxxxxxxx>; Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>;
> linux-crypto@xxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx; tglx@xxxxxxxxxxxxx;
> mingo@xxxxxxxxxx; bp@xxxxxxxxx; dave.hansen@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx;
> hpa@xxxxxxxxx; ebiggers@xxxxxxxxxx
> Subject: Re: [PATCH] crypto: x86: Do not acquire fpu context for too long
> 
> On Tue, 4 Oct 2022 at 17:17, Elliott, Robert (Servers) <elliott@xxxxxxx>
> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Taehee Yoo <ap420073@xxxxxxxxx>
> > > Sent: Tuesday, October 4, 2022 1:02 AM
> > > To: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
> > > Cc: linux-crypto@xxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx; tglx@xxxxxxxxxxxxx;
> > > mingo@xxxxxxxxxx; bp@xxxxxxxxx; dave.hansen@xxxxxxxxxxxxxxx;
> x86@xxxxxxxxxx;
> > > hpa@xxxxxxxxx; ardb@xxxxxxxxxx; ebiggers@xxxxxxxxxx
> > > Subject: Re: [PATCH] crypto: x86: Do not acquire fpu context for too long
> > >
> > > Hi Herbert,
> > > Thanks a lot for your review!
> > >
> > > On 10/4/22 13:52, Herbert Xu wrote:
> > >  > On Tue, Oct 04, 2022 at 04:49:12AM +0000, Taehee Yoo wrote:
> > >  >>
> > >  >>   #define ECB_WALK_START(req, bsize, fpu_blocks) do {
> \
> > >  >>           void *ctx =
> crypto_skcipher_ctx(crypto_skcipher_reqtfm(req));   \
> > >  >> + unsigned int walked_bytes = 0;                                  \
> > >  >>           const int __bsize = (bsize);
> \
> > >  >>           struct skcipher_walk walk;
> \
> > >  >> - int err = skcipher_walk_virt(&walk, (req), false);              \
> > >  >> + int err;                                                        \
> > >  >> +                                                                 \
> > >  >> + err = skcipher_walk_virt(&walk, (req), false);                  \
> > >  >>           while (walk.nbytes > 0) {
> \
> > >  >> -         unsigned int nbytes = walk.nbytes;                      \
> > >  >> -         bool do_fpu = (fpu_blocks) != -1 &&                     \
> > >  >> -                       nbytes >= (fpu_blocks) * __bsize;         \
> > >  >>                   const u8 *src = walk.src.virt.addr;
> \
> > >  >> -         u8 *dst = walk.dst.virt.addr;                           \
> > >  >>                   u8 __maybe_unused buf[(bsize)];
> \
> > >  >> -         if (do_fpu) kernel_fpu_begin()
> > >  >> +         u8 *dst = walk.dst.virt.addr;                           \
> > >  >> +         unsigned int nbytes;                                    \
> > >  >> +         bool do_fpu;                                            \
> > >  >> +                                                                 \
> > >  >> +         if (walk.nbytes - walked_bytes > ECB_CBC_WALK_MAX) {    \
> > >  >> +                 nbytes = ECB_CBC_WALK_MAX;                      \
> > >  >> +                 walked_bytes += ECB_CBC_WALK_MAX;               \
> > >  >> +         } else {                                                \
> > >  >> +                 nbytes = walk.nbytes - walked_bytes;            \
> > >  >> +                 walked_bytes = walk.nbytes;                     \
> > >  >> +         }                                                       \
> > >  >> +                                                                 \
> > >  >> +         do_fpu = (fpu_blocks) != -1 &&                          \
> > >  >> +                  nbytes >= (fpu_blocks) * __bsize;              \
> > >  >> +         if (do_fpu)                                             \
> > >  >> +                 kernel_fpu_begin()
> > >  >>
> > >  >>   #define CBC_WALK_START(req, bsize, fpu_blocks)
> \
> > >  >>           ECB_WALK_START(req, bsize, fpu_blocks)
> > >  >> @@ -65,8 +81,12 @@
> > >  >>   } while (0)
> > >  >>
> > >  >>   #define ECB_WALK_END()
> \
> > >  >> -         if (do_fpu) kernel_fpu_end();                           \
> > >  >> +         if (do_fpu)                                             \
> > >  >> +                 kernel_fpu_end();                               \
> > >  >> +         if (walked_bytes < walk.nbytes)                         \
> > >  >> +                 continue;                                       \
> > >  >>                   err = skcipher_walk_done(&walk, nbytes);
> \
> > >  >> +         walked_bytes = 0;                                       \
> > >  >>           }
> \
> > >  >>           return err;
> \
> > >  >>   } while (0)
> > >  >
> > >  > skcipher_walk_* is supposed to return at most a page.  Why is this
> > >  > necessary?
> > >  >
> > >  > Cheers,
> > >
> > > I referred to below link.
> > >
> https://lore.kernel.org/all/MW5PR84MB18426EBBA3303770A8BC0BDFAB759@MW5PR84MB18
> > > 42.NAMPRD84.PROD.OUTLOOK.COM/
> > >
> > > Sorry for that I didn't check that skcipher_walk_* returns only under 4K
> > > sizes.
> > > So, I thought fpu context would be too long.
> > > But, I just checked the skcipher_walk_*, and it's right, it returns
> > > under 4K sizes.
> > > So, there are no problems as you mentioned.
> > >
> > > Thank you so much!
> > > Taehee Yoo
> >
> > I think functions using the ECB and CBC macros (and
> > those helper functions) are safe  - notice the third
> > argument is called fpu_blocks. So, Aria's ECB mode is
> > probably safe. There are no CTR macros, so that needs
> > to be checked more carefully.
> >
> > We need to check all the functions that don't use the
> > macros and functions. The SHA functions (I've started
> > working on a patch for those).
> >
> > Running modprobe tcrypt mode=0, I encountered RCU stalls
> > on these:
> >         tcrypt: testing encryption speed of sync skcipher cts(cbc(aes))
> using cts(cbc(aes-aesni))
> >         tcrypt: testing encryption speed of sync skcipher cfb(aes) using
> cfb(aes-aesni)
> >
> > aesni-intel_glue.c registers "__cts(cbs(aes))", not "cts(cbc(aes-aesni)",
> > and doesn't register any cfb algorithms, so those tests are using the
> > generic templates, which must not be breaking up the loops as needed.
> >
> 
> These all use the aes-aesni cipher wrapped in various layers of
> generic code. The core cipher puts kernel_fpu_begin/end around every
> AES block {16 bytes) it processes, so I doubt that the crypto code is
> at fault here, unless the issues is in tcrypt itself.

In 2018, commit 2af632996b89 ("crypto: tcrypt - reschedule during
speed tests") added cond_resched() calls to "Avoid RCU stalls in
the case of non-preemptible kernel and lengthy speed tests by
rescheduling when advancing from one block size to another."

It only makes those calls if the sec module parameter is used
(run the speed test for a certain number of seconds), not the
default "cycles" mode.

    if (secs) {
        ret = test_mb_acipher_jiffies(data, enc, bs, secs, num_mb);
        cond_resched();
    } else {
        ret = test_mb_acipher_cycles(data, enc, bs, num_mb);
    }

In the original sighting, all three occurred during mode 200.

The cycle counts range from 151 to 221840 clock cycles. The
stalls do not correlate to the longer cycle counts.

The first 8 tests were fine:

Aug 11 18:55:06 adevxp033-sys unknown: Running modprobe tcrypt mode=200
tcrypt: testing encryption speed of sync skcipher ecb(aes) using ecb(aes-aesni)
tcrypt: testing decryption speed of sync skcipher ecb(aes) using ecb(aes-aesni)
tcrypt: testing encryption speed of sync skcipher cbc(aes) using cbc(aes-aesni)
tcrypt: testing decryption speed of sync skcipher cbc(aes) using cbc(aes-aesni)
tcrypt: testing encryption speed of sync skcipher lrw(aes) using lrw(ecb(aes-aesni))
tcrypt: testing decryption speed of sync skcipher lrw(aes) using lrw(ecb(aes-aesni))
tcrypt: testing encryption speed of sync skcipher xts(aes) using xts(ecb(aes-aesni))
tcrypt: testing decryption speed of sync skcipher xts(aes) using xts(ecb(aes-aesni))

but the 9th triggered the first stall warning:
tcrypt: testing encryption speed of sync skcipher cts(cbc(aes)) using cts(cbc(aes-aesni))
tcrypt: test 0 (128 bit key, 16 byte blocks): 1 operation in 243 cycles (16 bytes)
tcrypt: test 1 (128 bit key, 64 byte blocks): 1 operation in 728 cycles (64 bytes)
tcrypt: test 2 (128 bit key, 128 byte blocks): 1 operation in 939 cycles (128 bytes)
tcrypt: test 3 (128 bit key, 256 byte blocks): 1 operation in 1341 cycles (256 bytes)
tcrypt: test 4 (128 bit key, 1024 byte blocks): 1 operation in 3786 cycles (1024 bytes)
tcrypt: test 5 (128 bit key, 1424 byte blocks): 1 operation in 5064 cycles (1424 bytes)
tcrypt: test 6 (128 bit key, 4096 byte blocks): 1 operation in 13889 cycles (4096 bytes)
tcrypt: test 7 (192 bit key, 16 byte blocks): 1 operation in 236 cycles (16 bytes)
rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: {
1 operation in 745 cycles (64 bytes)

The stack trace prints interspersed with the next 26 test prints,
as it moved into:
tcrypt: testing decryption speed of sync skcipher cts(cbc(aes)) using cts(cbc(aes-aesni))

It proceeded through the next two with no problem:
tcrypt: testing encryption speed of sync skcipher ctr(aes) using ctr(aes-aesni)
tcrypt: testing decryption speed of sync skcipher ctr(aes) using ctr(aes-aesni)

and then reported two stalls during cfb encryption and cfb decryption:
tcrypt: testing encryption speed of sync skcipher cfb(aes) using cfb(aes-aesni)
tcrypt: test 0 (128 bit key, 16 byte blocks): 1 operation in 182 cycles (16 bytes)
tcrypt: test 1 (128 bit key, 64 byte blocks): 1 operation in 347 cycles (64 bytes)
tcrypt: test 2 (128 bit key, 128 byte blocks): 1 operation in 542 cycles (128 bytes)
tcrypt: test 3 (128 bit key, 256 byte blocks): 1 operation in 963 cycles (256 bytes)
rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: {
1 operation in 3465 cycles (1024 bytes)
tcrypt: test 5 (128 bit key, 1420 byte blocks): 1 operation in 4808 cycles (1420 bytes)
tcrypt: test 6 (128 bit key, 4096 byte blocks): 1 operation in 13631 cycles (4096 bytes)
tcrypt: test 7 (192 bit key, 16 byte blocks): 1 operation in 190 cycles (16 bytes)
tcrypt: test 8 (192 bit key, 64 byte blocks): 1 operation in 360 cycles (64 bytes)
tcrypt: test 9 (192 bit key, 128 byte blocks): 1 operation in 570 cycles (128 bytes)
tcrypt: test 10 (192 bit key, 256 byte blocks): 1 operation in 1010 cycles (256 bytes)
tcrypt: test 11 (192 bit key, 1024 byte blocks): 1 operation in 3660 cycles (1024 bytes)
tcrypt: test 12 (192 bit key, 1420 byte blocks): 1 operation in 5088 cycles (1420 bytes)
tcrypt: test 13 (192 bit key, 4096 byte blocks): 1 operation in 14781 cycles (4096 bytes)
tcrypt: test 14 (256 bit key, 16 byte blocks): 1 operation in 195 cycles (16 bytes)
tcrypt: test 15 (256 bit key, 64 byte blocks): 1 operation in 363 cycles (64 bytes)
tcrypt: test 16 (256 bit key, 128 byte blocks): 1 operation in 592 cycles (128 bytes)
1 operation in 1046 cycles (256 bytes)
tcrypt: test 18 (256 bit key, 1024 byte blocks): 1 operation in 3796 cycles (1024 bytes)
tcrypt: test 19 (256 bit key, 1420 byte blocks): 1 operation in 5283 cycles (1420 bytes)
tcrypt: test 20 (256 bit key, 4096 byte blocks): 1 operation in 14925 cycles (4096 bytes)
tcrypt: testing decryption speed of sync skcipher cfb(aes) using cfb(aes-aesni)
1 operation in 184 cycles (16 bytes)
1 operation in 415 cycles (64 bytes)
tcrypt: test 2 (128 bit key, 128 byte blocks): 1 operation in 661 cycles (128 bytes)
rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: {
1 operation in 1215 cycles (256 bytes)

None of the other modes (from 0 to 999) had problems.

mode=200 ran from 18:55:06 to 18:55:09, one of the longer durations,
but not unique (e.g., mode 300 took 4 sec, mode 400 took 5 sec).

Perhaps the cycles mode needs to call cond_resched() too?