Jakub Kicinski wrote: > On Mon, 08 Jul 2019 19:13:29 +0000, John Fastabend wrote: > > Resolve a series of splats discovered by syzbot and an unhash > > TLS issue noted by Eric Dumazet. > > > > The main issues revolved around interaction between TLS and > > sockmap tear down. TLS and sockmap could both reset sk->prot > > ops creating a condition where a close or unhash op could be > > called forever. A rare race condition resulting from a missing > > rcu sync operation was causing a use after free. Then on the > > TLS side dropping the sock lock and re-acquiring it during the > > close op could hang. Finally, sockmap must be deployed before > > tls for current stack assumptions to be met. This is enforced > > now. A feature series can enable it. > > > > To fix this first refactor TLS code so the lock is held for the > > entire teardown operation. Then add an unhash callback to ensure > > TLS can not transition from ESTABLISHED to LISTEN state. This > > transition is a similar bug to the one found and fixed previously > > in sockmap. Then apply three fixes to sockmap to fix up races > > on tear down around map free and close. Finally, if sockmap > > is destroyed before TLS we add a new ULP op update to inform > > the TLS stack it should not call sockmap ops. This last one > > appears to be the most commonly found issue from syzbot. > > Looks like strparser is not done'd for offload? Right so if rx_conf != TLS_SW then the hardware needs to do the strparser functionality. > > About patch 6 - I was recently wondering about the "impossible" syzbot > report where context is not freed and my conclusion was that there > can be someone sitting at lock_sock() in tcp_close() already by the > time we start installing the ULP, so TLS's close will never get called. > The entire replacing of callbacks business is really shaky :( Well replacing callbacks is the ULP model. The race we are fixing in patch 6 is sockmap being free'd which removes psock and resets proto ops with tcp_close() path. I don't think there is another race like you describe because tcp_set_ulp is called from do_tcp_setsockopt which holds the lock and tcp state is checked to ensure its ESTABLISHED. A closing sock wont be in ESTABLISHED state so any setup will be aborted. Before patch 1 though I definately saw this race because we dropped the lock mid-close. With this series I've been running those syzbot programs over night without issue on 4 cores. Also selftests pass in ./net/tls and ./bpf/ so I think its stable and resolves many of the issues syzbot has been stomping around. > > Perhaps I'm rumbling, I will take a close look after I get some sleep :) Yes please do ;)