Re: [PATCH 1/2] tls: remove close callback sock unlock/lock and flush_sync

John Fastabend <john.fastabend@xxxxxxxxx> · Fri, 28 Jun 2019 12:40:29 -0700

Jakub Kicinski wrote:
> On Fri, 28 Jun 2019 07:12:07 -0700, John Fastabend wrote:
> > Yeah seems possible although never seen in my testing. So I'll
> > move the test_bit() inside the lock and do a ctx check to ensure
> > still have the reference.
> > 
> >   CPU 0 (free)           CPU 1 (wq)
> > 
> >   lock(sk)
> >                          lock(sk)
> >   set_bit()
> >   cancel_work()
> >   release
> >                          ctx = tls_get_ctx(sk)
> >                          unlikely(!ctx) <- we may have free'd 
> >                          test_bit()
> >                          ...
> >                          release()
> > 
> > or
> > 
> >   CPU 0 (free)           CPU 1 (wq)
> > 
> >                          lock(sk)
> >   lock(sk)
> >                          ctx = tls_get_ctx(sk)
> >                          unlikely(!ctx)
> >                          test_bit()
> >                          ...
> >                          release()
> >   set_bit()
> >   cancel_work()
> >   release
> 
> Hmm... perhaps it's cleanest to stop the work from scheduling before we
> proceed?
> 
> close():
> 	while (!test_and_set(SHED))
> 		flush();
> 
> 	lock(sk);
> 	...
> 
> We just need to move init work, no?

The lock() is already held when entering unhash() side so need to
handle this case as well,

CPU 0 (free)          CPU 1 (wq)

lock(sk)              ctx = tls_get_ctx(sk) <- need to be check null ptr
sk_prot->unhash()
  set_bit()
  cancel_work()
  ...
  kfree(ctx)
unlock(sk)

but using cancel and doing an unlikely(!ctx) check should be
sufficient to handle wq. What I'm not sure how to solve now is
in patch 2 of this series unhash is still calling strp_done
with the sock lock. Maybe we need to do a deferred release
like sockmap side?

Trying to drop the lock and then grabbing it again doesn't
seem right to me seems based on comment in tcp_abort we
could potentially "race with userspace socket closes such
as tcp_close". iirc I think one of the tls splats from syzbot
looked something like this may have happened.

For now I'm considering adding a strp_cancel() op. Seeing
we are closing() the socket and tearkng down we can probably
be OK with throwing out strp results.

> 
> FWIW I never tested his async crypto stuff, I wonder if there is a way
> to convince normal CPU crypto to pretend to be async?