Re: [RFC PATCH 0/2] livepatch: Add support for hybrid mode

Yafang Shao <laoar.shao@xxxxxxxxx> · Wed, 5 Feb 2025 14:16:42 +0800

On Tue, Feb 4, 2025 at 9:05 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
>
> On Mon 2025-02-03 17:44:52, Yafang Shao wrote:
> > On Fri, Jan 31, 2025 at 9:18 PM Miroslav Benes <mbenes@xxxxxxx> wrote:
> > >
> > > > >
> > > > >   + What exactly is meant by frequent replacements (busy loop?, once a minute?)
> > > >
> > > > The script:
> > > >
> > > > #!/bin/bash
> > > > while true; do
> > > >         yum install -y ./kernel-livepatch-6.1.12-0.x86_64.rpm
> > > >         ./apply_livepatch_61.sh # it will sleep 5s
> > > >         yum erase -y kernel-livepatch-6.1.12-0.x86_64
> > > >         yum install -y ./kernel-livepatch-6.1.6-0.x86_64.rpm
> > > >         ./apply_livepatch_61.sh  # it will sleep 5s
> > > > done
> > >
> > > A live patch application is a slowpath. It is expected not to run
> > > frequently (in a relative sense).
> >
> > The frequency isn’t the main concern here; _scalability_ is the key issue.
> > Running livepatches once per day (a relatively low frequency) across all of our
> > production servers (hundreds of thousands) isn’t feasible. Instead, we need to
> > periodically run tests on a subset of test servers.
>
> I am confused. The original problem was a system crash when
> livepatching do_exit() function, see
> https://lore.kernel.org/r/CALOAHbA9WHPjeZKUcUkwULagQjTMfqAdAg+akqPzbZ7Byc=qrw@xxxxxxxxxxxxxx

Why do you view this patchset as a solution to the original problem?

>
> The rcu watchdog warning was first mentioned in this patchset.
> Do you see rcu watchdog warning in production or just
> with this artificial test, please?

So, we shouldn’t run any artificial tests on the livepatch, correct?
What exactly is the issue with these test cases?

>
>
> > > If you stress it like this, it is quite
> > > expected that it will have an impact. Especially on a large busy system.
> >
> > It seems you agree that the current atomic-replace process lacks scalability.
> > When deploying a livepatch across a large fleet of servers, it’s impossible to
> > ensure that the servers are idle, as their workloads are constantly varying and
> > are not deterministic.
>
> Do you see the scalability problem in production, please?

Yes, the livepatch transition was stalled.

> And could you prove that it was caused by livepatching, please?

When the livepatch transition is stalled, running `kpatch list` will
display the stalled information.

>
>
> > The challenges are very different when managing 1K servers versus 1M servers.
> > Similarly, the issues differ significantly between patching a single
> > function and
> > patching 100 functions, especially when some of those functions are critical.
> > That’s what scalability is all about.
> >
> > Since we transitioned from the old livepatch mode to the new
> > atomic-replace mode,
>
> What do you mean with the old livepatch mode, please?

$ kpatch-build -R

>
> Did you allow to install more livepatches in parallel?

No.

> What was the motivation to switch to the atomic replace, please?

This is the default behavior of kpatch [1] after upgrading to a new version.

[1].  https://github.com/dynup/kpatch/tree/master

>
> > our SREs have consistently reported that one or more servers become
> > stalled during
> > the upgrade (replacement).
>
> What is SRE, please?