Re: [RFC PATCH 2/2] livepatch: Implement livepatch hybrid mode

Petr Mladek <pmladek@xxxxxxxx> · Tue, 4 Feb 2025 14:21:53 +0100

On Mon 2025-01-27 23:34:50, Yafang Shao wrote:
> On Mon, Jan 27, 2025 at 10:31 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
> >
> > On Mon 2025-01-27 14:35:26, Yafang Shao wrote:
> > > The atomic replace livepatch mechanism was introduced to handle scenarios
> > > where we want to unload a specific livepatch without unloading others.
> > > However, its current implementation has significant shortcomings, making
> > > it less than ideal in practice. Below are the key downsides:
> >
> > [...]
> >
> > > In the hybrid mode:
> > >
> > > - Specific livepatches can be marked as "non-replaceable" to ensure they
> > >   remain active and unaffected during replacements.
> > >
> > > - Other livepatches can be marked as "replaceable," allowing targeted
> > >   replacements of only those patches.
> > >
> > > This selective approach would reduce unnecessary transitions, lower the
> > > risk of temporary patch loss, and mitigate performance issues during
> > > livepatch replacement.
> > >
> > > --- a/kernel/livepatch/core.c
> > > +++ b/kernel/livepatch/core.c
> > > @@ -658,6 +658,8 @@ static int klp_add_nops(struct klp_patch *patch)
> > >               klp_for_each_object(old_patch, old_obj) {
> > >                       int err;
> > >
> > > +                     if (!old_patch->replaceable)
> > > +                             continue;
> >
> > This is one example where things might get very complicated.
> 
> Why does this example even exist in the first place?
> If hybrid mode is enabled, this scenario could have been avoided entirely.

How exactly this scenario could be avoided, please?

In the real life, livepatches are used to fix many bugs.
And every bug is fixed by livepatching several functions.

Fix1 livepatches: funcA
Fix2 livepatches: funcA, funcB
Fix3 livepatches: funcB

How would you handle this with the hybrid model?

Which fix will be handled by livepatches installed in parallel?
Which fix will be handled by atomic replace?
How would you keep it consistent?

How would it work when there are hundreds of fixes and thousands
of livepatched functions?

Where exactly is the advantage of the hybrid model?

> >
> > The same function might be livepatched by more livepatches, see
> > ops->func_stack. For example, let's have funcA and three livepatches:
> > a
> >   + lp1:
> >         .replace = false,
> >         .non-replace = true,
> >         .func = {
> >                         .old_name = "funcA",
> >                         .new_func = lp1_funcA,
> >                 }, { }
> >
> >   + lp2:
> >         .replace = false,
> >         .non-replace = false,
> >         .func = {
> >                         .old_name = "funcA",
> >                         .new_func = lp2_funcA,
> >                 },{
> >                         .old_name = "funcB",
> >                         .new_func = lp2_funcB,
> >                 }, { }
> >
> >
> >   + lp3:
> >         .replace = true,
> >         .non-replace = false,
> >         .func = {
> >                         .old_name = "funcB",
> >                         .new_func = lp3_funcB,
> >                 }, { }
> >
> >
> > Now, apply lp1:
> >
> >       + funcA() gets redirected to lp1_funcA()
> >
> > Then, apply lp2
> >
> >       + funcA() gets redirected to lp2_funcA()
> >
> > Finally, apply lp3:
> >
> >       + The proposed code would add "nop()" for
> >         funcA() because it exists in lp2 and does not exist in lp3.
> >
> >       + funcA() will get redirected to the original code
> >         because of the nop() during transition
> >
> >       + nop() will get removed in klp_complete_transition() and
> >         funcA() will get suddenly redirected to lp1_funcA()
> >         because it will still be on ops->func_stack even
> >         after the "nop" and lp2_funcA() gets removed.
> >
> >            => The entire system will start using another funcA
> >               implementation at some random time
> >
> >            => this would violate the consistency model
> >
> >
> > The proper solution might be tricky:
> >
> > 1. We would need to detect this situation and do _not_ add
> >    the "nop" for lp3 and funcA().
> >
> > 2. We would need a more complicate code for handling the task states.
> >
> >    klp_update_patch_state() sets task->patch_state using
> >    the global "klp_target_state". But in the above example,
> >    when enabling lp3:
> >
> >     + funcA would need to get transitioned _backward_:
> >          KLP_TRANSITION_PATCHED -> KLP_TRANSITION_UNPATCHED
> >       , so that it goes on ops->func_stack:
> >          lp2_funcA -> lp1->funcA
> >
> >    while:
> >
> >     + funcA would need to get transitioned forward:
> >          KLP_TRANSITION_UNPATCHED -> KLP_TRANSITION_PATCHED
> >       , so that it goes on ops->func_stack:
> >          lp2_funcB -> lp3->funcB
> >
> >
> > => the hybrid mode would complicate the life for both livepatch
> >    creators/maintainers and kernel code developers/maintainers.
> >
> 
> I don’t believe they should be held responsible for poor
> configurations. These settings could have been avoided from the start.
> There are countless tunable knobs in the system—should we try to
> accommodate every possible combination? No, that’s not how systems are
> designed to operate. Instead, we should identify and follow best
> practices to ensure optimal functionality.

I am sorry but I do not understand the above paragraph at all.

Livepatches modify functions.
How is it related to system configuration or tunable knobs?
What best practices are you talking, please?

Best Regards,
Petr