Re: [bpf PATCH v2 0/6] bpf: sockmap/tls fixes

John Fastabend <john.fastabend@xxxxxxxxx> · Tue, 09 Jul 2019 08:40:14 -0700

Jakub Kicinski wrote:
> On Mon, 08 Jul 2019 19:13:29 +0000, John Fastabend wrote:
> > Resolve a series of splats discovered by syzbot and an unhash
> > TLS issue noted by Eric Dumazet.
> > 
> > The main issues revolved around interaction between TLS and
> > sockmap tear down. TLS and sockmap could both reset sk->prot
> > ops creating a condition where a close or unhash op could be
> > called forever. A rare race condition resulting from a missing
> > rcu sync operation was causing a use after free. Then on the
> > TLS side dropping the sock lock and re-acquiring it during the
> > close op could hang. Finally, sockmap must be deployed before
> > tls for current stack assumptions to be met. This is enforced
> > now. A feature series can enable it.
> > 
> > To fix this first refactor TLS code so the lock is held for the
> > entire teardown operation. Then add an unhash callback to ensure
> > TLS can not transition from ESTABLISHED to LISTEN state. This
> > transition is a similar bug to the one found and fixed previously
> > in sockmap. Then apply three fixes to sockmap to fix up races
> > on tear down around map free and close. Finally, if sockmap
> > is destroyed before TLS we add a new ULP op update to inform
> > the TLS stack it should not call sockmap ops. This last one
> > appears to be the most commonly found issue from syzbot.
> 
> Looks like strparser is not done'd for offload?

Right so if rx_conf != TLS_SW then the hardware needs to do
the strparser functionality.

> 
> About patch 6 - I was recently wondering about the "impossible" syzbot
> report where context is not freed and my conclusion was that there
> can be someone sitting at lock_sock() in tcp_close() already by the
> time we start installing the ULP, so TLS's close will never get called.
> The entire replacing of callbacks business is really shaky :(

Well replacing callbacks is the ULP model. The race we are fixing in
patch 6 is sockmap being free'd which removes psock and resets proto ops
with tcp_close() path.

I don't think there is another race like you describe because tcp_set_ulp
is called from do_tcp_setsockopt which holds the lock and tcp state is
checked to ensure its ESTABLISHED. A closing sock wont be in ESTABLISHED
state so any setup will be aborted. Before patch 1 though I definately
saw this race because we dropped the lock mid-close.

With this series I've been running those syzbot programs over night
without issue on 4 cores. Also selftests pass in ./net/tls and ./bpf/
so I think its stable and resolves many of the issues syzbot has been
stomping around.

> 
> Perhaps I'm rumbling, I will take a close look after I get some sleep :)

Yes please do ;)