Re: [RFC PATCH 0/2] livepatch: Add support for hybrid mode

Yafang Shao <laoar.shao@xxxxxxxxx> · Mon, 10 Feb 2025 10:50:10 +0800

On Sat, Feb 8, 2025 at 10:49 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
>
> On Fri, Feb 7, 2025 at 7:01 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
> >
> > On Wed 2025-02-05 14:16:42, Yafang Shao wrote:
> > > On Tue, Feb 4, 2025 at 9:05 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
> > > >
> > > > On Mon 2025-02-03 17:44:52, Yafang Shao wrote:
> > > > > On Fri, Jan 31, 2025 at 9:18 PM Miroslav Benes <mbenes@xxxxxxx> wrote:
> > > > > >
> > > > > > > >
> > > > > > > >   + What exactly is meant by frequent replacements (busy loop?, once a minute?)
> > > > > > >
> > > > > > > The script:
> > > > > > >
> > > > > > > #!/bin/bash
> > > > > > > while true; do
> > > > > > >         yum install -y ./kernel-livepatch-6.1.12-0.x86_64.rpm
> > > > > > >         ./apply_livepatch_61.sh # it will sleep 5s
> > > > > > >         yum erase -y kernel-livepatch-6.1.12-0.x86_64
> > > > > > >         yum install -y ./kernel-livepatch-6.1.6-0.x86_64.rpm
> > > > > > >         ./apply_livepatch_61.sh  # it will sleep 5s
> > > > > > > done
> > > > > >
> > > > > > A live patch application is a slowpath. It is expected not to run
> > > > > > frequently (in a relative sense).
> > > > >
> > > > > The frequency isn’t the main concern here; _scalability_ is the key issue.
> > > > > Running livepatches once per day (a relatively low frequency) across all of our
> > > > > production servers (hundreds of thousands) isn’t feasible. Instead, we need to
> > > > > periodically run tests on a subset of test servers.
> > > >
> > > > I am confused. The original problem was a system crash when
> > > > livepatching do_exit() function, see
> > > > https://lore.kernel.org/r/CALOAHbA9WHPjeZKUcUkwULagQjTMfqAdAg+akqPzbZ7Byc=qrw@xxxxxxxxxxxxxx
> > >
> > > Why do you view this patchset as a solution to the original problem?
> >
> > You discovered the hardlockup warnings when trying to reproduce the
> > original problem. At least, you mentioned this at
> > https://lore.kernel.org/r/20250122085146.41553-1-laoar.shao@xxxxxxxxx
> >
> > And using the hybrid module would allow to livepatch do_exit() only
> > once and do not touch it any longer. It is not the right solution
> > but it would be a workaround.
> >
> >
> > > > The rcu watchdog warning was first mentioned in this patchset.
> > > > Do you see rcu watchdog warning in production or just
> > > > with this artificial test, please?
> > >
> > > So, we shouldn’t run any artificial tests on the livepatch, correct?
> > > What exactly is the issue with these test cases?
> >
> > Some tests might be too artificial. They might find problems which
> > do not exist in practice.
> >
> > Disclaimer: I do not say that this is the case. You actually prove
> >         later in this mail that the hardlockup warning happen
> >         even in production.
> >
> > Anyway, if an artificial test finds a potential problem and the fix is
> > complicated then we need to decide if it is worth it.
> >
> > It does not make sense to complicate the code too much when
> > the fixed problem does not happen in practice.
> >
> >   + Too complicated code might introduce regressions which are
> >     more serious than the original problem.
> >
> >   + Too complicated code increases the maintenance cost. It is
> >     more complicated to add new features or fix bugs.
> >
> >
> > > > > > If you stress it like this, it is quite
> > > > > > expected that it will have an impact. Especially on a large busy system.
> > > > >
> > > > > It seems you agree that the current atomic-replace process lacks scalability.
> > > > > When deploying a livepatch across a large fleet of servers, it’s impossible to
> > > > > ensure that the servers are idle, as their workloads are constantly varying and
> > > > > are not deterministic.
> > > >
> > > > Do you see the scalability problem in production, please?
> > >
> > > Yes, the livepatch transition was stalled.
> >
> > Good to know.
> >
> > >
> > > > And could you prove that it was caused by livepatching, please?
> > >
> > > When the livepatch transition is stalled, running `kpatch list` will
> > > display the stalled information.
> >
> > OK.
> >
> > > > > The challenges are very different when managing 1K servers versus 1M servers.
> > > > > Similarly, the issues differ significantly between patching a single
> > > > > function and
> > > > > patching 100 functions, especially when some of those functions are critical.
> > > > > That’s what scalability is all about.
> > > > >
> > > > > Since we transitioned from the old livepatch mode to the new
> > > > > atomic-replace mode,
> > > >
> > > > What do you mean with the old livepatch mode, please?
> > >
> > > $ kpatch-build -R
> >
> > I am not familiar with kpatch-build. OK, I see the following at
> > https://github.com/dynup/kpatch/blob/master/kpatch-build/kpatch-build
> >
> > echo "          -R, --non-replace       Disable replace patch (replace is on by default)" >&2
> >
> > > >
> > > > Did you allow to install more livepatches in parallel?
> > >
> > > No.
> >
> > I guess that there is a misunderstanding. I am sorry my wording was
> > not clear enough.
> >
> > By "installing" more livepatches in parallel I meant to "have enabled"
> > more livepatches in parallel. It is possible only when you do not
> > use the atomic replace.
> >
> > By other words, if you use "kpatch-build -R" then you could have
> > enabled more livepatches in parallel.
> >
> >
> > > > What was the motivation to switch to the atomic replace, please?
> > >
> > > This is the default behavior of kpatch [1] after upgrading to a new version.
> > >
> > > [1].  https://github.com/dynup/kpatch/tree/master
> >
> > OK. I wonder if the atomic replace simplified some actions for you.
>
> Actually, it simplifies the livepatch deployment since it only
> involves a single livepatch.
>
> >
> > I ask because the proposed "hybrid" model is very similar to the "old"
> > model which did not use the atomic replace.
>
> It’s essentially a hybrid of the old model and the atomic replace model.
>
> >
> > What are the advantages of the "hybrid" model over the "old" model, please?
>
> - Advantages compared to the old model
>   In the old model, it’s not possible to replace a specific livepatch
> with a new one. In the hybrid model, however, you can replace specific
> livepatches as needed.
>
> - Advantages and disadvantages compared to the atomic replace model
>    - Advantage
>      In the atomic replace model, you must replace all old
> livepatches, regardless of whether they are relevant to the new
> livepatch. In the hybrid model, you only need to replace the relevant
> livepatches.
>
>    - Disadvantage
>      In the atomic replace model, you only need to maintain a single
> livepatch. However, in the hybrid model, you need to manage multiple
> livepatches.
>
> >
> >
> > > > > our SREs have consistently reported that one or more servers become
> > > > > stalled during
> > > > > the upgrade (replacement).
> > > >
> > > > What is SRE, please?
> > >
> > > >From the wikipedia : https://en.wikipedia.org/wiki/Site_reliability_engineering
> >
> > Good to know.
> >
> > > > Could you please show some log from a production system?
> > >
> > > When the SREs initially reported that the livepatch transition was
> > > stalled, I simply advised them to try again. However, after
> > > experiencing these crashes, I dug deeper into the livepatch code and
> > > realized that scalability is a concern. As a result, periodically
> > > replacing an old livepatch triggers RCU warnings on our production
> > > servers.
> > >
> > > [Wed Feb  5 10:56:10 2025] livepatch: enabling patch 'livepatch_61_release6'
> > > [Wed Feb  5 10:56:10 2025] livepatch: 'livepatch_61_release6':
> > > starting patching transition
> > > [Wed Feb  5 10:56:24 2025] rcu_tasks_wait_gp: rcu_tasks grace period
> > > 1126113 is 10078 jiffies old.
> > > [Wed Feb  5 10:56:38 2025] rcu_tasks_wait_gp: rcu_tasks grace period
> > > 1126117 is 10199 jiffies old.
> > > [Wed Feb  5 10:56:52 2025] rcu_tasks_wait_gp: rcu_tasks grace period
> > > 1126121 is 10047 jiffies old.
> > > [Wed Feb  5 10:56:57 2025] livepatch: 'livepatch_61_release6': patching complete
> >
> > I guess that this happens primary because there are many processes
> > running in kernel code:
> >
> >        + many processes => long task list
> >        + in kernel code => need to check stack
> >
> > I wondered how much it is caused by livepatching do_exit() which
> > takes tasklist_lock. The idea was:
>
> I’ll drop the changes in do_exit() and run the test again.

The RCU warning is still triggered even when the do_exit() is not
livepatched, but the frequency of occurrence decreases slightly.

However, with the following change, the RCU warning ceases to appear,
even when do_exit() is present and the test is run at a high frequency
(once every 5 seconds).

diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index 30187b1..c436aa6 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -400,11 +400,22 @@ void klp_try_complete_transition(void)
         * unless the patch includes changes to a very common function.
         */
        read_lock(&tasklist_lock);
-       for_each_process_thread(g, task)
+       for_each_process_thread(g, task) {
+               if (task->patch_state == klp_target_state)
+                       continue;
                if (!klp_try_switch_task(task))
                        complete = false;
+
+               if (need_resched()) {
+                       complete = false;
+                       break;
+               }
+       }
        read_unlock(&tasklist_lock);

+       /* The above operation might be expensive. */
+       cond_resched();
+
        /*
         * Ditto for the idle "swapper" tasks.
         */

-- 
Regards
Yafang