Initial results so far is that adding back IPSET_ATTR_GC seems to make the difference. I've only done two rounds of testing so far and we've had at least one false positive in the past with our environments so I'd like to give it a little more time to say for sure. I should be able to get back with you early next week more conclusively. This was my patch: diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h index 6397d75899bc..e5abb4d24e75 100644 --- a/include/uapi/linux/netfilter/ipset/ip_set.h +++ b/include/uapi/linux/netfilter/ipset/ip_set.h @@ -93,6 +93,7 @@ enum { IPSET_ATTR_CADT_MAX = 16, /* Create-only specific attributes */ IPSET_ATTR_INITVAL, /* was unused IPSET_ATTR_GC */ + IPSET_ATTR_GC, IPSET_ATTR_HASHSIZE, IPSET_ATTR_MAXELEM, IPSET_ATTR_NETMASK, It would be good to know though from the maintainers perspective whether or not this even makes sense or not? The primary user of iptables / ipsets on our hosts is kube-router, the Kubernetes networking framework we use. Its usage for each of these tools involves exec-ing out to the ipset (https://github.com/cloudnativelabs/kube-router/blob/master/pkg/utils/ipset.go#L176) and iptables based binaries (via the iptables wrapper scripts from the coreos Go library: https://github.com/coreos/go-iptables). In regards to ipsets, it mostly calls "ipset save" (https://github.com/cloudnativelabs/kube-router/blob/master/pkg/utils/ipset.go#L535) and "ipset restore -exist" (https://github.com/cloudnativelabs/kube-router/blob/master/pkg/utils/ipset.go#L551). There are a few outliers in the code that create and populate ipsets individually (also using the binaries), but most of those have been converted to save / restore because of the performance penalties of exec-ing out constantly. On our images we run ipset 7.15 and iptables 1.8.8. -Aaron From: Jakub Kicinski <kuba@xxxxxxxxxx> Date: Thursday, June 30, 2022 at 1:04 PM To: U'ren, Aaron <Aaron.U'ren@xxxxxxxx> Cc: Thorsten Leemhuis <regressions@xxxxxxxxxxxxx>, McLean, Patrick <Patrick.Mclean@xxxxxxxx>, Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx>, netfilter-devel@xxxxxxxxxxxxxxx <netfilter-devel@xxxxxxxxxxxxxxx>, Brown, Russell <Russell.Brown@xxxxxxxx>, Rueger, Manuel <manuel.rueger@xxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx <linux-kernel@xxxxxxxxxxxxxxx>, regressions@xxxxxxxxxxxxxxx <regressions@xxxxxxxxxxxxxxx>, Florian Westphal <fw@xxxxxxxxx>, netdev@xxxxxxxxxxxxxxx <netdev@xxxxxxxxxxxxxxx>, Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxx> Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15 Sounds like you're pretty close to figuring this out! Can you check if the user space is intentionally setting IPSET_ATTR_INITVAL? Either that or IPSET_ATTR_GC was not as "unused" as initially thought. Testing something like this could be a useful data point: diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h index 6397d75899bc..7caf9b53d2a7 100644 --- a/include/uapi/linux/netfilter/ipset/ip_set.h +++ b/include/uapi/linux/netfilter/ipset/ip_set.h @@ -92,7 +92,7 @@ enum { /* Reserve empty slots */ IPSET_ATTR_CADT_MAX = 16, /* Create-only specific attributes */ - IPSET_ATTR_INITVAL, /* was unused IPSET_ATTR_GC */ + IPSET_ATTR_GC, IPSET_ATTR_HASHSIZE, IPSET_ATTR_MAXELEM, IPSET_ATTR_NETMASK, @@ -104,6 +104,8 @@ enum { IPSET_ATTR_REFERENCES, IPSET_ATTR_MEMSIZE, + IPSET_ATTR_INITVAL, + __IPSET_ATTR_CREATE_MAX, }; #define IPSET_ATTR_CREATE_MAX (__IPSET_ATTR_CREATE_MAX - 1) On Thu, 30 Jun 2022 14:59:14 +0000 U'ren, Aaron wrote: > Thorsten / Jozsef - > > Thanks for continuing to follow up! I'm sorry that this has moved so slow, it has taken us a bit to find the time to fully track this issue down, however, I think that we have figured out enough to make some more forward progress on this issue. > > Jozsef, thanks for your insight into what is happening between those system calls. In regards to your question about wait/wound mutex debugging possibly being enabled, I can tell you that we definitely don't have that enabled on any of our regular machines. While we were debugging we did turn on quite a few debug options to help us try and track this issue down and it is very possible that the strace that was taken that started off this email was taken on a machine that did have that debug option enabled. Either way though, the root issue occurs on hosts that definitely do not have wait/wound mutex debugging enabled. > > The good news is that we finally got one of our development environments into a state where we could reliably reproduce the performance issue across reboots. This was a win because it meant that we were able to do a full bisect of the kernel and were able to tell relatively quickly whether or not the issue was present in the test kernels. > > After bisecting for 3 days, I have been able to narrow it down to a single commit: https://urldefense.com/v3/__https:/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3976ca101990ca11ddf51f38bec7b86c19d0ca6f__;!!JmoZiZGBv3RvKRSx!_lqhDxhzSRqmfCd8UGNkA-LzIuwspFCIolqiglSgQ0Y0TPMlX67qYlEk4Bh7IVFlB53_TUqF-16t0Q$ ; (netfilter: ipset: Expose the initval hash parameter to userspace) > > I'm at a bit of a loss as to why this would cause such severe performance regressions, but I've proved it out multiple times now. I've even checked out a fresh version of the 5.15 kernel that we've been deploying with just this single commit reverted and found that the performance problems are completely resolved. > > I'm hoping that maybe Jozsef will have some more insight into why this seemingly innocuous commit causes such larger performance issues for us? If you have any additional patches or other things that you would like us to test I will try to leave our environment in its current state for the next couple of days so that we can do so. > > -Aaron > > From: Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> > Date: Monday, June 20, 2022 at 2:16 AM > To: U'ren, Aaron <Aaron.U'ren@xxxxxxxx> > Cc: McLean, Patrick <Patrick.Mclean@xxxxxxxx>, Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx>, netfilter-devel@xxxxxxxxxxxxxxx <netfilter-devel@xxxxxxxxxxxxxxx>, Brown, Russell <Russell.Brown@xxxxxxxx>, Rueger, Manuel <manuel.rueger@xxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx <linux-kernel@xxxxxxxxxxxxxxx>, regressions@xxxxxxxxxxxxxxx <regressions@xxxxxxxxxxxxxxx>, Florian Westphal <fw@xxxxxxxxx>, netdev@xxxxxxxxxxxxxxx <netdev@xxxxxxxxxxxxxxx>, Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxx> > Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15 > On 31.05.22 09:41, Jozsef Kadlecsik wrote: > > On Mon, 30 May 2022, Thorsten Leemhuis wrote: > >> On 04.05.22 21:37, U'ren, Aaron wrote: > [...] > > > > Every set lookups behind "iptables" needs two getsockopt() calls: you can > > see them in the strace logs. The first one check the internal protocol > > number of ipset and the second one verifies/gets the processed set (it's > > an extension to iptables and therefore there's no internal state to save > > the protocol version number). > > Hi Aaron! Did any of the suggestions from Jozsef help to track down the > root case? I have this issue on the list of tracked regressions and > wonder what the status is. Or can I mark this as resolved? > > Side note: this is not a "something breaks" regressions and it seems to > progress slowly, so I'm putting it on the backburner: > > #regzbot backburner: performance regression where the culprit is hard to > track down > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > > P.S.: As the Linux kernel's regression tracker I deal with a lot of > reports and sometimes miss something important when writing mails like > this. If that's the case here, don't hesitate to tell me in a public > reply, it's in everyone's interest to set the public record straight. > > [...] > > > > In your strace log > > > > 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024> > > 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$ > > 0.109456 close(4) = 0 <0.000022> > > > > the only things which happen in the second sockopt function are to lock > > the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the > > setname, save the result in the case of a match and unlock the mutex. > > Nothing complicated, no deep, multi-level function calls. Just a few line > > of codes which haven't changed. > > > > The only thing which can slow down the processing is the mutex handling. > > Don't you have accidentally wait/wound mutex debugging enabled in the > > kernel? If not, then bisecting the mutex related patches might help. > > > > You wrote that flushing tables or ipsets didn't seem to help. That > > literally meant flushing i.e. the sets were emptied but not destroyed? Did > > you try both destroying or flushing? > > > >> Jozsef, I still have this issue on my list of tracked regressions and it > >> looks like nothing happens since above mail (or did I miss it?). Could > >> you maybe provide some guidance to Aaron to get us all closer to the > >> root of the problem? > > > > I really hope it's an accidentally enabled debugging option in the kernel. > > Otherwise bisecting could help to uncover the issue. > > > > Best regards, > > Jozsef > > > >> P.S.: As the Linux kernel's regression tracker I deal with a lot of > >> reports and sometimes miss something important when writing mails like > >> this. If that's the case here, don't hesitate to tell me in a public > >> reply, it's in everyone's interest to set the public record straight.