On Tue, Feb 4, 2025 at 9:05 PM Petr Mladek <pmladek@xxxxxxxx> wrote: > > On Mon 2025-02-03 17:44:52, Yafang Shao wrote: > > On Fri, Jan 31, 2025 at 9:18 PM Miroslav Benes <mbenes@xxxxxxx> wrote: > > > > > > > > > > > > > + What exactly is meant by frequent replacements (busy loop?, once a minute?) > > > > > > > > The script: > > > > > > > > #!/bin/bash > > > > while true; do > > > > yum install -y ./kernel-livepatch-6.1.12-0.x86_64.rpm > > > > ./apply_livepatch_61.sh # it will sleep 5s > > > > yum erase -y kernel-livepatch-6.1.12-0.x86_64 > > > > yum install -y ./kernel-livepatch-6.1.6-0.x86_64.rpm > > > > ./apply_livepatch_61.sh # it will sleep 5s > > > > done > > > > > > A live patch application is a slowpath. It is expected not to run > > > frequently (in a relative sense). > > > > The frequency isn’t the main concern here; _scalability_ is the key issue. > > Running livepatches once per day (a relatively low frequency) across all of our > > production servers (hundreds of thousands) isn’t feasible. Instead, we need to > > periodically run tests on a subset of test servers. > > I am confused. The original problem was a system crash when > livepatching do_exit() function, see > https://lore.kernel.org/r/CALOAHbA9WHPjeZKUcUkwULagQjTMfqAdAg+akqPzbZ7Byc=qrw@xxxxxxxxxxxxxx Why do you view this patchset as a solution to the original problem? > > The rcu watchdog warning was first mentioned in this patchset. > Do you see rcu watchdog warning in production or just > with this artificial test, please? So, we shouldn’t run any artificial tests on the livepatch, correct? What exactly is the issue with these test cases? > > > > > If you stress it like this, it is quite > > > expected that it will have an impact. Especially on a large busy system. > > > > It seems you agree that the current atomic-replace process lacks scalability. > > When deploying a livepatch across a large fleet of servers, it’s impossible to > > ensure that the servers are idle, as their workloads are constantly varying and > > are not deterministic. > > Do you see the scalability problem in production, please? Yes, the livepatch transition was stalled. > And could you prove that it was caused by livepatching, please? When the livepatch transition is stalled, running `kpatch list` will display the stalled information. > > > > The challenges are very different when managing 1K servers versus 1M servers. > > Similarly, the issues differ significantly between patching a single > > function and > > patching 100 functions, especially when some of those functions are critical. > > That’s what scalability is all about. > > > > Since we transitioned from the old livepatch mode to the new > > atomic-replace mode, > > What do you mean with the old livepatch mode, please? $ kpatch-build -R > > Did you allow to install more livepatches in parallel? No. > What was the motivation to switch to the atomic replace, please? This is the default behavior of kpatch [1] after upgrading to a new version. [1]. https://github.com/dynup/kpatch/tree/master > > > our SREs have consistently reported that one or more servers become > > stalled during > > the upgrade (replacement). > > What is SRE, please?