On Thu 2023-01-26 15:12:35, Seth Forshee (DigitalOcean) wrote: > On Thu, Jan 26, 2023 at 06:03:16PM +0100, Petr Mladek wrote: > > On Fri 2023-01-20 16:12:20, Seth Forshee (DigitalOcean) wrote: > > > We've fairly regularaly seen liveptches which cannot transition within kpatch's > > > timeout period due to busy vhost worker kthreads. > > > > I have missed this detail. Miroslav told me that we have solved > > something similar some time ago, see > > https://lore.kernel.org/all/20220507174628.2086373-1-song@xxxxxxxxxx/ > > Interesting thread. I had thought about something along the lines of the > original patch, but there are some ideas in there that I hadn't > considered. Could you please provide some more details about the test system? Is there anything important to make it reproducible? The following aspects come to my mind. It might require: + more workers running on the same system + have a dedicated CPU for the worker + livepatching the function called by work->fn() + running the same work again and again + huge and overloaded system > > Honestly, kpatch's timeout 1 minute looks incredible low to me. Note > > that the transition is tried only once per minute. It means that there > > are "only" 60 attempts. > > > > Just by chance, does it help you to increase the timeout, please? > > To be honest my test setup reproduces the problem well enough to make > KLP wait significant time due to vhost threads, but it seldom causes it > to hit kpatch's timeout. > > Our system management software will try to load a patch tens of times in > a day, and we've seen real-world cases where patches couldn't load > within kpatch's timeout for multiple days. But I don't have such an > environment readily accessible for my own testing. I can try to refine > my test case and see if I can get it to that point. My understanding is that you try to load the patch repeatedly but it always fails after the 1 minute timeout. It means that it always starts from the beginning (no livepatched process). Is there any chance to try it with a longer timeout, for example, one hour? It should increase the chance if there are more problematic kthreads. > > This low timeout might be useful for testing. But in practice, it does > > not matter when the transition is lasting one hour or even longer. > > It takes much longer time to prepare the livepatch. > > Agreed. And to be clear, we cope with the fact that patches may take > hours or even days to get applied in some cases. The patches I sent are > just about improving the only case I've identified which has lead to > kpatch failing to load a patch for a day or longer. If it is acceptable to wait hours or even days then the 1 minute timeout is quite contra-productive. We actually do not use any timeout at all in livepatches provided by SUSE. Best Regards, Petr