On Sat, Feb 8, 2025 at 10:49 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > On Fri, Feb 7, 2025 at 7:01 PM Petr Mladek <pmladek@xxxxxxxx> wrote: > > > > On Wed 2025-02-05 14:16:42, Yafang Shao wrote: > > > On Tue, Feb 4, 2025 at 9:05 PM Petr Mladek <pmladek@xxxxxxxx> wrote: > > > > > > > > On Mon 2025-02-03 17:44:52, Yafang Shao wrote: > > > > > On Fri, Jan 31, 2025 at 9:18 PM Miroslav Benes <mbenes@xxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > + What exactly is meant by frequent replacements (busy loop?, once a minute?) > > > > > > > > > > > > > > The script: > > > > > > > > > > > > > > #!/bin/bash > > > > > > > while true; do > > > > > > > yum install -y ./kernel-livepatch-6.1.12-0.x86_64.rpm > > > > > > > ./apply_livepatch_61.sh # it will sleep 5s > > > > > > > yum erase -y kernel-livepatch-6.1.12-0.x86_64 > > > > > > > yum install -y ./kernel-livepatch-6.1.6-0.x86_64.rpm > > > > > > > ./apply_livepatch_61.sh # it will sleep 5s > > > > > > > done > > > > > > > > > > > > A live patch application is a slowpath. It is expected not to run > > > > > > frequently (in a relative sense). > > > > > > > > > > The frequency isn’t the main concern here; _scalability_ is the key issue. > > > > > Running livepatches once per day (a relatively low frequency) across all of our > > > > > production servers (hundreds of thousands) isn’t feasible. Instead, we need to > > > > > periodically run tests on a subset of test servers. > > > > > > > > I am confused. The original problem was a system crash when > > > > livepatching do_exit() function, see > > > > https://lore.kernel.org/r/CALOAHbA9WHPjeZKUcUkwULagQjTMfqAdAg+akqPzbZ7Byc=qrw@xxxxxxxxxxxxxx > > > > > > Why do you view this patchset as a solution to the original problem? > > > > You discovered the hardlockup warnings when trying to reproduce the > > original problem. At least, you mentioned this at > > https://lore.kernel.org/r/20250122085146.41553-1-laoar.shao@xxxxxxxxx > > > > And using the hybrid module would allow to livepatch do_exit() only > > once and do not touch it any longer. It is not the right solution > > but it would be a workaround. > > > > > > > > The rcu watchdog warning was first mentioned in this patchset. > > > > Do you see rcu watchdog warning in production or just > > > > with this artificial test, please? > > > > > > So, we shouldn’t run any artificial tests on the livepatch, correct? > > > What exactly is the issue with these test cases? > > > > Some tests might be too artificial. They might find problems which > > do not exist in practice. > > > > Disclaimer: I do not say that this is the case. You actually prove > > later in this mail that the hardlockup warning happen > > even in production. > > > > Anyway, if an artificial test finds a potential problem and the fix is > > complicated then we need to decide if it is worth it. > > > > It does not make sense to complicate the code too much when > > the fixed problem does not happen in practice. > > > > + Too complicated code might introduce regressions which are > > more serious than the original problem. > > > > + Too complicated code increases the maintenance cost. It is > > more complicated to add new features or fix bugs. > > > > > > > > > > If you stress it like this, it is quite > > > > > > expected that it will have an impact. Especially on a large busy system. > > > > > > > > > > It seems you agree that the current atomic-replace process lacks scalability. > > > > > When deploying a livepatch across a large fleet of servers, it’s impossible to > > > > > ensure that the servers are idle, as their workloads are constantly varying and > > > > > are not deterministic. > > > > > > > > Do you see the scalability problem in production, please? > > > > > > Yes, the livepatch transition was stalled. > > > > Good to know. > > > > > > > > > And could you prove that it was caused by livepatching, please? > > > > > > When the livepatch transition is stalled, running `kpatch list` will > > > display the stalled information. > > > > OK. > > > > > > > The challenges are very different when managing 1K servers versus 1M servers. > > > > > Similarly, the issues differ significantly between patching a single > > > > > function and > > > > > patching 100 functions, especially when some of those functions are critical. > > > > > That’s what scalability is all about. > > > > > > > > > > Since we transitioned from the old livepatch mode to the new > > > > > atomic-replace mode, > > > > > > > > What do you mean with the old livepatch mode, please? > > > > > > $ kpatch-build -R > > > > I am not familiar with kpatch-build. OK, I see the following at > > https://github.com/dynup/kpatch/blob/master/kpatch-build/kpatch-build > > > > echo " -R, --non-replace Disable replace patch (replace is on by default)" >&2 > > > > > > > > > > Did you allow to install more livepatches in parallel? > > > > > > No. > > > > I guess that there is a misunderstanding. I am sorry my wording was > > not clear enough. > > > > By "installing" more livepatches in parallel I meant to "have enabled" > > more livepatches in parallel. It is possible only when you do not > > use the atomic replace. > > > > By other words, if you use "kpatch-build -R" then you could have > > enabled more livepatches in parallel. > > > > > > > > What was the motivation to switch to the atomic replace, please? > > > > > > This is the default behavior of kpatch [1] after upgrading to a new version. > > > > > > [1]. https://github.com/dynup/kpatch/tree/master > > > > OK. I wonder if the atomic replace simplified some actions for you. > > Actually, it simplifies the livepatch deployment since it only > involves a single livepatch. > > > > > I ask because the proposed "hybrid" model is very similar to the "old" > > model which did not use the atomic replace. > > It’s essentially a hybrid of the old model and the atomic replace model. > > > > > What are the advantages of the "hybrid" model over the "old" model, please? > > - Advantages compared to the old model > In the old model, it’s not possible to replace a specific livepatch > with a new one. In the hybrid model, however, you can replace specific > livepatches as needed. > > - Advantages and disadvantages compared to the atomic replace model > - Advantage > In the atomic replace model, you must replace all old > livepatches, regardless of whether they are relevant to the new > livepatch. In the hybrid model, you only need to replace the relevant > livepatches. > > - Disadvantage > In the atomic replace model, you only need to maintain a single > livepatch. However, in the hybrid model, you need to manage multiple > livepatches. > > > > > > > > > > our SREs have consistently reported that one or more servers become > > > > > stalled during > > > > > the upgrade (replacement). > > > > > > > > What is SRE, please? > > > > > > >From the wikipedia : https://en.wikipedia.org/wiki/Site_reliability_engineering > > > > Good to know. > > > > > > Could you please show some log from a production system? > > > > > > When the SREs initially reported that the livepatch transition was > > > stalled, I simply advised them to try again. However, after > > > experiencing these crashes, I dug deeper into the livepatch code and > > > realized that scalability is a concern. As a result, periodically > > > replacing an old livepatch triggers RCU warnings on our production > > > servers. > > > > > > [Wed Feb 5 10:56:10 2025] livepatch: enabling patch 'livepatch_61_release6' > > > [Wed Feb 5 10:56:10 2025] livepatch: 'livepatch_61_release6': > > > starting patching transition > > > [Wed Feb 5 10:56:24 2025] rcu_tasks_wait_gp: rcu_tasks grace period > > > 1126113 is 10078 jiffies old. > > > [Wed Feb 5 10:56:38 2025] rcu_tasks_wait_gp: rcu_tasks grace period > > > 1126117 is 10199 jiffies old. > > > [Wed Feb 5 10:56:52 2025] rcu_tasks_wait_gp: rcu_tasks grace period > > > 1126121 is 10047 jiffies old. > > > [Wed Feb 5 10:56:57 2025] livepatch: 'livepatch_61_release6': patching complete > > > > I guess that this happens primary because there are many processes > > running in kernel code: > > > > + many processes => long task list > > + in kernel code => need to check stack > > > > I wondered how much it is caused by livepatching do_exit() which > > takes tasklist_lock. The idea was: > > I’ll drop the changes in do_exit() and run the test again. The RCU warning is still triggered even when the do_exit() is not livepatched, but the frequency of occurrence decreases slightly. However, with the following change, the RCU warning ceases to appear, even when do_exit() is present and the test is run at a high frequency (once every 5 seconds). diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index 30187b1..c436aa6 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -400,11 +400,22 @@ void klp_try_complete_transition(void) * unless the patch includes changes to a very common function. */ read_lock(&tasklist_lock); - for_each_process_thread(g, task) + for_each_process_thread(g, task) { + if (task->patch_state == klp_target_state) + continue; if (!klp_try_switch_task(task)) complete = false; + + if (need_resched()) { + complete = false; + break; + } + } read_unlock(&tasklist_lock); + /* The above operation might be expensive. */ + cond_resched(); + /* * Ditto for the idle "swapper" tasks. */ -- Regards Yafang