On Mon, Jan 20, 2025 at 2:42 PM David Rientjes <rientjes@xxxxxxxxxx> wrote: > > On Mon, 20 Jan 2025, Jason Gunthorpe wrote: > > > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote: > > > Hi, > > > > > > I'd like to discuss memory persistence across kexec. > > > > > > Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows > > > serialization and deserialization of kernel data as well as preserving > > > arbitrary memory ranges across kexec. > > > > > > In addition, KHO keeps a physically contiguous memory regions that are > > > guaranteed to not have any memory that KHO would preserve, but still can be > > > used by the system. The kexeced kernel bootstraps itself using those > > > regions and sets all handed over memory as in use. KHO users then can > > > recover their state from the preserved data. This includes memory > > > reservations, where the user can either discard or claim reservations. > > > > > > KHO can be used as the base layer for implementation of persistence-aware > > > memory allocator and persistent in-memory filesystem. > > > > > > Aside from status update on KHO progress there are a few topics that I would > > > like to discuss: > > > * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs? > > This is a very timely discussion since the last Linux MM Alignment Session > on the topic since some use cases, at least for tmpfs, have emerged. Not > necessarily a requirement, but more out of convenience. > > > > * Or is it better to implement yet another in-memory filesystem dedicated > > > for persistence? > > > * What is the best way to ensure that the memory we want to persist is not > > > scattered all over the place? > > > > There is alot of talk about taking *drivers* and having them survive > > kexec, meaning the driver has to put alot of its state into KHO and > > then get it back out again. > > > > I've been hoping for a model where a driver can be told to "go to KHO" > > and the KHO code can be largely contained in the driver and regulated > > to recording the driver state. This implies the state may be > > fragmented all over memory. > > > > This sounds fantastic if it is doable! > > > The other direction is that the driver has to start up in some special > > KHO mode and KHO becomes invasive on all driver paths to use special > > KHO allocations. This seems like a PITA. > > > > You can see this difference just in the discussion around the iommu > > serialization where one idea was to have KHO be an integral (and > > invasive!) part of the page table operations from time zero vs some > > later serialization at kexec time. > > > > Regardless, I'm interested in this discussion to bring some > > concreteness about how drivers work.. > > > > +1, I'm also interested in this discussion. > > As previously mentioned[1], we'll also start a biweekly on hypervisor live > update to accelerate progress. The first instance of that meeting will be > next week, Monday, January 27 at 8am PST (UTC-8). Calendar invites will > go out later today for everybody on that email thread, if anybody else is > interested in attending on a regular basis please email me. Hoping this > can be leveraged as well to build up to LSF/MM/BPF. > > [1] > https://lore.kernel.org/kexec/2908e4ab-abc4-ddd0-b191-fe820856cfb4@xxxxxxxxxx/T/#u +1 Hi Mike, I'm very interested in this topic and can contribute both presenting and implementing changes upstream. We're planning on using KHO in our kernel at Google but there are some limitations for our use case that I believe can be addressed. Limitations: 1. Serialization callbacks are called by KHO during the activation phase in series. In most cases different device drivers are independent, the serialization can be parallelized. 2. Once the serialization callbacks are done, the device tree data cannot be altered and drivers cannot add more data into the device tree (except limited modification where drivers can remember the exact node that was created and modify some properties, but that is too limited). This is bad because we have use cases where we need to save buffered data (not memory locations) into the device tree at some late stage before jumping to the new kernel. 3. KHO requires devices to be serialized before kexec_file_load()/kexec_load(), which means that load becomes part of the VM blackout window, if KHO is used for hypervisor live update scenarios this is a very bad limitation. 4. KHO activation should not really be needed, there should be two phases: old KHO tree passed from the old kernel, and once it is fully consumed, new KHO tree that can be updated at any time by devices that is going to be passed to the next kernel during next reboot (kexec or firmware that is aware of KHO...), instead of activation there should be a user driver phase shift from old tree to new tree, once that is done drivers can start serialize at will. I believe all these limitations are because KHO uses device trees in FDT format during serialization. Instead, the KHO device tree in the new kernel should be in a more relaxed format such as hash table until it is converted into FDT sometime very late during the kernel shutdown path, right before jumping to the next kernel. As David mentioned there is going to be a hypervisor live update bi-weekly meeting, where we can discuss this. Pasha