Re: [PATCH v2 0/3] livepatch: Add "replace" sysfs attribute

Yafang Shao <laoar.shao@xxxxxxxxx> · Sun, 23 Jun 2024 10:52:40 +0800

On Fri, Jun 21, 2024 at 6:04 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
>
> On Fri 2024-06-21 16:39:40, Yafang Shao wrote:
> > On Fri, Jun 21, 2024 at 3:31 PM Petr Mladek <pmladek@xxxxxxxx> wrote:
> > >
> > > On Tue 2024-06-11 10:46:47, Yafang Shao wrote:
> > > > On Tue, Jun 11, 2024 at 1:19 AM Song Liu <song@xxxxxxxxxx> wrote:
> > > > >
> > > > > Hi Yafang,
> > > > >
> > > > > On Sun, Jun 9, 2024 at 6:33 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > > > >
> > > > > > Add "replace" sysfs attribute to show whether a livepatch supports
> > > > > > atomic replace or not.
> > > > >
> > > > > I am curious about the use case here.
> > > >
> > > > We will use this flag to check if there're both atomic replace
> > > > livepatch and non atomic replace livepatch running on a single server
> > > > at the same time. That can happen if we install a new non atomic
> > > > replace livepatch to a server which has already applied an atomic
> > > > replace livepatch.
> > > >
> > > > > AFAICT, the "replace" flag
> > > > > matters _before_ the livepatch is loaded. Once the livepatch is
> > > > > loaded, the "replace" part is already finished. Therefore, I think
> > > > > we really need a way to check the replace flag before loading the
> > > > > .ko file? Maybe in modinfo?
> > > >
> > > > It will be better if we can check it before loading it. However it
> > > > depends on how we build the livepatch ko, right? Take the
> > > > kpatch-build[0] for example, we have to modify the kpatch-build to add
> > > > support for it but we can't ensure all users will use it to build the
> > > > livepatch.
> > >
> > > > [0]. https://github.com/dynup/kpatch
> > >
> > > I still do not understand how the sysfs attribute would help here.
> > > It will show the type of the currently used livepatches. But
> > > it won't say what the to-be-installed livepatch would do.
> > >
> > > Could you please describe how exactly you would use the information?
> > > What would be the process/algorithm/logic which would prevent a mistake?
> > >
> > > Honestly, it sounds like your processes are broken and you try
> > > to fix them on the wrong end.
> > >
> > > Allowing to load random livepatches which are built a random way
> > > sounds like a hazard.
> >
> > They are not random live patches. Some system administrators maintain
> > their own livepatches tailored for specific workloads or develop
> > livepatches for A/B testing. Our company’s kernel team maintains the
> > general livepatches. This worked well in the past when all livepatches
> > were independent. However, the situation changed when we switched from
> > non atomic replace livepatches to atomic replace livepatches, causing
> > issues due to the uncertain behavior of mixed atomic replace and non
> > atomic replace livepatches.
>
> I think that the uncertain behavior has been even before you started
> using the atomic replace.
>
> How did the system administrators detect whether the livepatches were
> independent?
>
> It looks to me that it worked only by luck. Well, I could imagine
> that it has worked for a long time.

We have a limited number of livepatches in our system, primarily
intended to address critical issues that could potentially cause
system instability. Therefore, the likelihood of conflicts between two
livepatches is relatively low. While I acknowledge that there is
indeed a potential risk involved, it is the atomic replace livepatch
that poses this risk.

>
>
> > To address this change, we need a robust solution. One approach we
> > have identified is developing a CI build system for livepatches. All
> > livepatches must be built through this CI system, where they will
> > undergo CI tests to verify if they are atomic replace or not.
>
> The important part is that the livepatch authors have to see
> all already existing changes. They need to check that
> the new change would not break other livepatched code and
> vice versa.

Exactly. Through this CI system, all developers have visibility into
all the livepatches currently running on our system.

>
>
> > Additionally, in our production environment, we need to ensure that
> > there are no non atomic replace livepatches in use. For instance, some
> > system administrators might still build livepatches outside of our CI
> > system. Detecting whether a single livepatch is atomic replace or not
> > is not easy. To simplify this, we propose adding a new sysfs attribute
> > to facilitate this check.
> >
> > BTW, perhaps we could introduce a new sysctl setting to forcefully
> > forbid all non atomic replace livepatches?
>
> I like it. This looks like the most reliable solution. Would it
> solve your problem.
>
> Alternative solution would be to forbid installing non-replace
> livepatches when there is already installed a livepatch with
> the atomic replace. I could imagine that we enforce this for
> everyone (without sysctl knob). Would this work for you?

Perhaps we can add this sysctl knob as follows?

kernel.livepatch.forbid_non_atomic_replace:
    0 - Allow non atomic replace livepatch. (Default behavior)
    1 - Completely forbid non atomic replace livepatch.
    2 - Forbid non atomic replace livepatch only if there is already
an atomic replace livepatch running.

>
> That said, I am not completely against adding the sysfs attribute
> "replace". I could imagine that it might be useful when a more
> flexible solution is needed. But I think that it would be
> hard to make a more flexible solution safe, especially
> in your environment.

This sysfs attribute remains valuable as it aids in monitoring
livepatches, specifically allowing us to track which type of livepatch
is currently operating on a server.

--
Regards
Yafang