On Wed, Nov 6, 2024 at 5:54 PM Ondrej Mosnacek <omosnace@xxxxxxxxxx> wrote: > > On Wed, Nov 6, 2024 at 5:13 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote: > > > > On Wed, Nov 6, 2024 at 4:55 PM Ondrej Mosnacek <omosnace@xxxxxxxxxx> wrote: > > > > > > SELinux tracks the number of allocated xfrm_state/xfrm_policy objects > > > (via the selinux_xfrm_refcount variable) as an input in deciding if peer > > > labeling should be used. > > > > > > However, as a result of commits f35f821935d8 ("tcp: defer skb freeing > > > after socket lock is released") and 68822bdf76f1 ("net: generalize skb > > > freeing deferral to per-cpu lists"), freeing of a sk_buff object, which > > > may hold a reference to an xfrm_state object, can be deferred for > > > processing on another CPU core, so even after xfrm_state is deleted from > > > the configuration by userspace, the refcount isn't decremented until the > > > deferred freeing of relevant sk_buffs happens. On a system with many > > > cores this can take a very long time (even minutes or more if the system > > > is not very active), leading to peer labeling being enabled for much > > > longer than expected. > > > > > > Fix this by moving the selinux_xfrm_refcount decrementing to just after > > > the actual deletion of the xfrm objects rather than waiting for the > > > freeing to happen. For xfrm_policy it currently doesn't seem to be > > > necessary, but let's do the same there for consistency and > > > future-proofing. > > > > > > We hit this issue on a specific aarch64 256-core system, where the > > > sequence of unix_socket/test and inet_socket/tcp/test from > > > selinux-testsuite [1] would quite reliably trigger this scenario, and a > > > subsequent sctp/test run would then stumble because the policy for that > > > test misses some rules that would make it work under peer labeling > > > enabled (namely it was getting the netif::egress permission denied in > > > some of the test cases). > > > > > > [1] https://github.com/SELinuxProject/selinux-testsuite/ > > > > > > Fixes: f35f821935d8 ("tcp: defer skb freeing after socket lock is released") > > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") > > > Signed-off-by: Ondrej Mosnacek <omosnace@xxxxxxxxxx> > > > --- > > > > Can we explain why TCP packets sitting in TCP receive queues would > > need to keep xfrm_state around ? > > > > With thousands of TCP sockets. I would imagine that a similar issue > > would be hit, > > regardless of f35f821935d8 ("tcp: defer skb freeing after socket lock > > is released") and 68822bdf76f1 ("net: generalize skb freeing deferral > > to per-cpu lists") > > > > We remove the dst from these incoming packets (see skb_dst_drop() in > > tcp_data_queue() and tcp_add_backlog()), > > I do not see how XFRM state could be kept ? > > The problem is not with the xfrm_state reference via dst_entry, but > the one in skb_ext (skb->extensions) -> sec_path > (skb_ext_get_ptr(skb->extensions, SKB_EXT_SEC_PATH)) -> the xvec > array. But you have a point that I should say that in the commit > message... > So some secpath_reset() calls should be added in various protocol handlers before packets are possibly queued for hours in a socket queue ? I see one in l2tp_eth_dev_recv(). > And I think you're right that even without those commits a delay in > processing the RCU free callbacks could cause a similar issue, it just > wouldn't be as easy to trigger. That made me look deeper into history > which commit actually added the decrement on free and it turns out it > was done intentionally as a bugfix - see commit e4e8536f65b5 > ("selinux: fix the labeled xfrm/IPsec reference count handling"). > Before that commit the logic was similar to what my patch is doing, so > I could be re-introducing another bug here :-/ The commit message is > not very helpful there - Paul, do you happen to remember what the > issue was that prompted it? I guess there can be some alloc's that > won't have a matching delete in the right circumstances? Or something > involving the selinux_xfrm_policy_clone() case? > > -- > Ondrej Mosnacek > Senior Software Engineer, Linux Security - SELinux kernel > Red Hat, Inc. >