Jozef / Jakub / Thorsten- Thanks for all of your help with this issue. I think that we can close this out now. After continuing to dig into this problem some more, I eventually figured out that the problem was caused because of how our userspace tooling was interacting with ipset save / restore and the new (ish) initval option that is included in saves / restores. Specifically, kube-router runs an ipset save then processes the saved ipset data, messages it a bit based upon the state from the Kubernetes cluster, and then runs that data back through ipset restore. During this time, we create unique temporary sets based upon unique sets of options and then rotate in the new endpoints into the temporary set and then use swap instructions in order to minimize impact to the data path. However, because we were only messaging options that were recognized and important to us, initval was left alone and blindly copied into our option strings for new and temporary sets. This caused initval to be used incorrectly (i.e. the same initval ID was used for multiple sets). I'm not 100% sure about all of the consequences of this, but it seems to have objectively caused some performance issues. Additionally, since initval is intentionally unique between sets, this caused us to create many more temporary sets for swapping than was actually necessary. This caused obvious performance issues as restores now contained more instructions than they needed to. Reverting the commit removed the issue we saw because it removed the portion of the kernel that generated the initvals which caused ipset save to revert to its previous (5.10 and below) functionality. Additionally, applying your patches also had the same impact because while I believed I was updating our userspace ipset tools in tandem, I found that the headers were actually being copied in from an alternate location and were still using the vanilla headers. This meant that while the kernel was generating initval values, the userspace actually recognized it as IPSET_ATTR_GC values which were then unused. This was a very long process to come to such a simple recognition about the ipset save / restore format having been changed. I apologize for the noise. -Aaron From: U'ren, Aaron <Aaron.U'ren@xxxxxxxx> Date: Friday, July 8, 2022 at 3:08 PM To: Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxx>, Jakub Kicinski <kuba@xxxxxxxxxx> Cc: Thorsten Leemhuis <regressions@xxxxxxxxxxxxx>, McLean, Patrick <Patrick.Mclean@xxxxxxxx>, Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx>, netfilter-devel@xxxxxxxxxxxxxxx <netfilter-devel@xxxxxxxxxxxxxxx>, Brown, Russell <Russell.Brown@xxxxxxxx>, Rueger, Manuel <manuel.rueger@xxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx <linux-kernel@xxxxxxxxxxxxxxx>, regressions@xxxxxxxxxxxxxxx <regressions@xxxxxxxxxxxxxxx>, Florian Westphal <fw@xxxxxxxxx>, netdev@xxxxxxxxxxxxxxx <netdev@xxxxxxxxxxxxxxx> Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15 Jozsef / Jakub- Given your latest email and the fact that just adding back in IPSET_ATTR_GC doesn't shed any light on the issue I wanted to spend a lot more time testing. Also, I wanted to try to provide as much context for this issue as possible. I think that the iptables slowness is just a symptom not the cause of the issue. After spending a lot more time with it, I can see that iptables only runs slowly when an existing "ipset restore" process is being run by kube-router simultaneously. Given the other information that you've provided, my hunch is that iptables slows down when ipset restore is running because they are both vying for the same mutex? Anyway, I think troubleshooting it from the direction of iptables slowness is likely the wrong path to go down. The true problem seems to be that when IPSET_ATTR_GC is not included, somehow nodes are able to get into a state where "ipset restore" goes from completing in less than a 10th of a second, to taking 30 seconds to a minute to complete. The hard part, is that I still don't know what causes a node to enter this state. I have a Kubernetes cluster of about 7 nodes that I can reliably get into this state, but I have yet to be able to reproduce it consistently anywhere else. Other clusters will randomly exhibit the issue if IPSET_ATTR_GC is left out of the kernel, but not consistently. Since the email where we found the commit about 2 weeks ago, we have also been running 6 clusters of 9+ nodes with IPSET_ATTR_GC enabled and have not had any issues. Since we have a custom kernel configuration, I have also tried using the vanilla Ubuntu kernel configuration (taken from 5.15.0-40-generic) as well just to ensure that we didn't have some errant configuration option enabled. However, this also reliably reproduced the issue when IPSET_ATTR_GC was removed and just as reliably removed the issue when IPSET_ATTR_GC was added back in. I have also verified that neither ipset, iptables, or any of its dependent libraries have references to IPSET_ATTR_GC, going as far as to remove it from the ipset header file (https://git.netfilter.org/iptables/tree/include/linux/netfilter/ipset/ip_set.h#n86) and rebuild it (and all of the libraries and other tools) from scratch just as a hail mary. No changes to user-space seem to have an effect on this issue. One other thing that I've done to help track down the issue is to add debug options to kube-router so that it outputs the file that it feeds into "ipset restore -exist". With this file, on nodes affected by this issue, I can reliably reproduce the issue by calling "ipset restore -exist <file" and see that it takes 30+ seconds to execute. In a hope that maybe it sheds some light and gives you some more context, I'm going to be sending you and Jakub a copy of the strace and the ipset file that was used separately from this email. At this point, I'm not sure how to proceed other than with the files that I'll be sending you. I'm highly confident that somehow the removal of IPSET_ATTR_GC is causing the issues that we see. At this point I've added and removed the options almost 20 times and done reboots across our cluster. Anytime that variable is missing, we see several nodes exhibit the performance issues immediately. Any time the variable is present, we see no nodes exhibit the performance issues. Looking forward to hearing back from you and getting to the bottom of this very bizarre issue. -Aaron From: Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxx> Date: Saturday, July 2, 2022 at 12:41 PM To: Jakub Kicinski <kuba@xxxxxxxxxx> Cc: U'ren, Aaron <Aaron.U'ren@xxxxxxxx>, Thorsten Leemhuis <regressions@xxxxxxxxxxxxx>, McLean, Patrick <Patrick.Mclean@xxxxxxxx>, Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx>, netfilter-devel@xxxxxxxxxxxxxxx <netfilter-devel@xxxxxxxxxxxxxxx>, Brown, Russell <Russell.Brown@xxxxxxxx>, Rueger, Manuel <manuel.rueger@xxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx <linux-kernel@xxxxxxxxxxxxxxx>, regressions@xxxxxxxxxxxxxxx <regressions@xxxxxxxxxxxxxxx>, Florian Westphal <fw@xxxxxxxxx>, netdev@xxxxxxxxxxxxxxx <netdev@xxxxxxxxxxxxxxx> Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15 Hi, On Thu, 30 Jun 2022, Jakub Kicinski wrote: > Sounds like you're pretty close to figuring this out! Can you check > if the user space is intentionally setting IPSET_ATTR_INITVAL? > Either that or IPSET_ATTR_GC was not as "unused" as initially thought. IPSET_ATTR_GC was really unused. It was an old remnant from the time when ipset userspace-kernel communication was through set/getsockopt. However, when it was migrated to netlink, just the symbol was kept but it was not used either with the userspace tool or the kernel. Aaron, could you send me how to reproduce the issue? I have no idea how that patch could be the reason. Setting/getting/using IPSET_ATTR_INITVAL is totally independent from listing iptables rules. But if you have got a reproducer then I can dig into it. Best regards, Jozsef > Testing something like this could be a useful data point: > > diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h > index 6397d75899bc..7caf9b53d2a7 100644 > --- a/include/uapi/linux/netfilter/ipset/ip_set.h > +++ b/include/uapi/linux/netfilter/ipset/ip_set.h > @@ -92,7 +92,7 @@ enum { > /* Reserve empty slots */ > IPSET_ATTR_CADT_MAX = 16, > /* Create-only specific attributes */ > - IPSET_ATTR_INITVAL, /* was unused IPSET_ATTR_GC */ > + IPSET_ATTR_GC, > IPSET_ATTR_HASHSIZE, > IPSET_ATTR_MAXELEM, > IPSET_ATTR_NETMASK, > @@ -104,6 +104,8 @@ enum { > IPSET_ATTR_REFERENCES, > IPSET_ATTR_MEMSIZE, > > + IPSET_ATTR_INITVAL, > + > __IPSET_ATTR_CREATE_MAX, > }; > #define IPSET_ATTR_CREATE_MAX (__IPSET_ATTR_CREATE_MAX - 1) > > > On Thu, 30 Jun 2022 14:59:14 +0000 U'ren, Aaron wrote: > > Thorsten / Jozsef - > > > > Thanks for continuing to follow up! I'm sorry that this has moved so slow, it has taken us a bit to find the time to fully track this issue down, however, I think that we have figured out enough to make some more forward progress on this issue. > > > > Jozsef, thanks for your insight into what is happening between those system calls. In regards to your question about wait/wound mutex debugging possibly being enabled, I can tell you that we definitely don't have that enabled on any of our regular machines. While we were debugging we did turn on quite a few debug options to help us try and track this issue down and it is very possible that the strace that was taken that started off this email was taken on a machine that did have that debug option enabled. Either way though, the root issue occurs on hosts that definitely do not have wait/wound mutex debugging enabled. > > > > The good news is that we finally got one of our development environments into a state where we could reliably reproduce the performance issue across reboots. This was a win because it meant that we were able to do a full bisect of the kernel and were able to tell relatively quickly whether or not the issue was present in the test kernels. > > > > After bisecting for 3 days, I have been able to narrow it down to a single commit: https://urldefense.com/v3/__https:/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3976ca101990ca11ddf51f38bec7b86c19d0ca6f__;!!JmoZiZGBv3RvKRSx!9YR_bFOCOkQzPaUftFL2NvuKLm8zPa4tQr_DI8CUZEenjK4Rak_OFmUrCpmiNOaUaiueGbgsEqk0IirIc4I$ ; (netfilter: ipset: Expose the initval hash parameter to userspace) > > > > I'm at a bit of a loss as to why this would cause such severe performance regressions, but I've proved it out multiple times now. I've even checked out a fresh version of the 5.15 kernel that we've been deploying with just this single commit reverted and found that the performance problems are completely resolved. > > > > I'm hoping that maybe Jozsef will have some more insight into why this seemingly innocuous commit causes such larger performance issues for us? If you have any additional patches or other things that you would like us to test I will try to leave our environment in its current state for the next couple of days so that we can do so. > > > > -Aaron > > > > From: Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> > > Date: Monday, June 20, 2022 at 2:16 AM > > To: U'ren, Aaron <Aaron.U'ren@xxxxxxxx> > > Cc: McLean, Patrick <Patrick.Mclean@xxxxxxxx>, Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx>, netfilter-devel@xxxxxxxxxxxxxxx <netfilter-devel@xxxxxxxxxxxxxxx>, Brown, Russell <Russell.Brown@xxxxxxxx>, Rueger, Manuel <manuel.rueger@xxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx <linux-kernel@xxxxxxxxxxxxxxx>, regressions@xxxxxxxxxxxxxxx <regressions@xxxxxxxxxxxxxxx>, Florian Westphal <fw@xxxxxxxxx>, netdev@xxxxxxxxxxxxxxx <netdev@xxxxxxxxxxxxxxx>, Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxx> > > Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15 > > On 31.05.22 09:41, Jozsef Kadlecsik wrote: > > > On Mon, 30 May 2022, Thorsten Leemhuis wrote: > > >> On 04.05.22 21:37, U'ren, Aaron wrote: > > [...] > > > > > > Every set lookups behind "iptables" needs two getsockopt() calls: you can > > > see them in the strace logs. The first one check the internal protocol > > > number of ipset and the second one verifies/gets the processed set (it's > > > an extension to iptables and therefore there's no internal state to save > > > the protocol version number). > > > > Hi Aaron! Did any of the suggestions from Jozsef help to track down the > > root case? I have this issue on the list of tracked regressions and > > wonder what the status is. Or can I mark this as resolved? > > > > Side note: this is not a "something breaks" regressions and it seems to > > progress slowly, so I'm putting it on the backburner: > > > > #regzbot backburner: performance regression where the culprit is hard to > > track down > > > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > > > > P.S.: As the Linux kernel's regression tracker I deal with a lot of > > reports and sometimes miss something important when writing mails like > > this. If that's the case here, don't hesitate to tell me in a public > > reply, it's in everyone's interest to set the public record straight. > > > > [...] > > > > > > In your strace log > > > > > > 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024> > > > 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$ > > > 0.109456 close(4) = 0 <0.000022> > > > > > > the only things which happen in the second sockopt function are to lock > > > the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the > > > setname, save the result in the case of a match and unlock the mutex. > > > Nothing complicated, no deep, multi-level function calls. Just a few line > > > of codes which haven't changed. > > > > > > The only thing which can slow down the processing is the mutex handling. > > > Don't you have accidentally wait/wound mutex debugging enabled in the > > > kernel? If not, then bisecting the mutex related patches might help. > > > > > > You wrote that flushing tables or ipsets didn't seem to help. That > > > literally meant flushing i.e. the sets were emptied but not destroyed? Did > > > you try both destroying or flushing? > > > > > >> Jozsef, I still have this issue on my list of tracked regressions and it > > >> looks like nothing happens since above mail (or did I miss it?). Could > > >> you maybe provide some guidance to Aaron to get us all closer to the > > >> root of the problem? > > > > > > I really hope it's an accidentally enabled debugging option in the kernel. > > > Otherwise bisecting could help to uncover the issue. > > > > > > Best regards, > > > Jozsef > > > > > >> P.S.: As the Linux kernel's regression tracker I deal with a lot of > > >> reports and sometimes miss something important when writing mails like > > >> this. If that's the case here, don't hesitate to tell me in a public > > >> reply, it's in everyone's interest to set the public record straight. > - E-mail : kadlec@xxxxxxxxxxxxxxxxx, kadlecsik.jozsef@xxxxxxxxx PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!9YR_bFOCOkQzPaUftFL2NvuKLm8zPa4tQr_DI8CUZEenjK4Rak_OFmUrCpmiNOaUaiueGbgsEqk0Udypzvg$ Address : Wigner Research Centre for Physics H-1525 Budapest 114, POB. 49, Hungary