Re: [RFC v1 1/3] luo: Live Update Orchestrator

Jason Gunthorpe <jgg@xxxxxxxxxx> · Thu, 20 Mar 2025 11:43:38 -0300

On Thu, Mar 20, 2025 at 02:40:09AM +0000, Pasha Tatashin wrote:
> Introduces the Live Update Orchestrator (LUO), a new kernel subsystem
> designed to facilitate live updates. Live update is a method to reboot
> the kernel while attempting to keep selected devices alive across the
> reboot boundary, minimizing downtime.
> 
> The primary use case is cloud environments, allowing hypervisor updates
> without fully disrupting running virtual machines. VMs can be suspended
> while the hypervisor kernel reboots, and devices attached to these VM
> are kept operational by the LUO.
> 
> Features introduced:
> 
> - Core orchestration logic for managing the live update process.
> - A state machine (NORMAL, PREPARED, UPDATED, *_FAILED) to track
>   the progress of live updates.
> - Notifier chains for subsystems (device layer, interrupts, KVM, IOMMU,
>   etc.) to register callbacks for different live update events:
>     - LIVEUPDATE_PREPARE: Prepare for reboot (before blackout).
>     - LIVEUPDATE_REBOOT: Final serialization before kexec (blackout).
>     - LIVEUPDATE_FINISH: Cleanup after update (after blackout).
>     - LIVEUPDATE_CANCEL: Rollback actions on failure or user request.

I still don't think notifier chains are the right way to go about alot
of this, most if it should be driven off of the file descriptors and
fdbox, not through notification.

At the very least we should not be adding notifier chains without a
clear user of them, and I'm not convinced that the iommu driver or
vfio are those users at the moment.

I feel more like the iommu can be brought into the serialization
indirectly by putting an iommufd into a fdbox.

> - A sysfs interface (/sys/kernel/liveupdate/) for user-space control:
>     - `prepare`: Initiate preparation (write 1) or reset (write 0).
>     - `finish`: Finalize update in new kernel (write 1).
>     - `cancel`: Abort ongoing preparation or reboot (write 1).
>     - `reset`: Force state back to normal (write 1).
>     - `state`: Read-only view of the current LUO state.
>     - `enabled`: Read-only view of whether live update is enabled.

I also think we should give up on the sysfs. If fdbox is going forward
in a char dev direction then I think we should have two char devs
/dev/kho/serialize and /dev/kho/deserialize and run the whole thing
through that. The concepts shown in the fdbox patches should be merged
into the kho/serialize char dev as just a general architecture of open
the char dev, put stuff into it, then finalize and do the kexec.

It gives you more options to avoid things like notifiers and a very
clear "session" linked to a FD lifetime that encloses the
serialization effort. I think that will make error case cleanup easier
and the whole thing more maintainable. IMHO sysfs is not a great API
choice for something so complicated.

Also agree with Greg, I think this needs more thoughtful patch staging
with actual complete solutions. I think focusing on a progression of
demonstrable kexec preservation:
 - A simple KVM and the VM's backing memory in a memfd is perserved
 - A simple vfio-noiommu doing DMA to a preserved memfd, including not
   resetting the device (but with no iommu driver)
 - iommufd

This all builds on each other and introduces API along with concrete
and meaningful use cases.

I see alot of confusion in the various review comments in KHO work
that I think mis understands the scope of what would be brought into
this. It is not hundreds of FDs or hundreds of devices, but a very
very narrow and selective set that can work like this. Showing each
step along the way would help narrow the thinking.

Jason