Hi all, We had a great discussion today led by Mike Rapoport and James Gowans in the Linux MM Alignment Session, thank you both! Thanks to everybody who attended and provided great questions, suggestions, and feedback. Kexec HandOver (KHO)[1] is a proposal that enables compatible drivers to hand over their state to the kexec kernel. Mainly geared toward enabling live updates for cloud providers, so that the host kernel can update without disruption to virtual machines, additional use cases have also been discussed such as persistence for database workloads. ----->o----- Mike noted there have been several memory persistence proposals over the past ten years: PRAM[2], PKRAM[3], persistent memory pools[4], prmem[5], and now KHO + guestmemfs[6]. The goal for KHO is to provide a framework for drivers to hook into and describe state while preserving arbitrary non-GFP_MOVABLE memory pages. On x86, the DT is not required to boot and is passed through when data is filled in during the kexec. Kirill Shutemov noted there's been a use of EFI configuration tables which was used for unaccepted memory to pass a bitmap of which memory has been accepted already. Mike described scratch memory as a CMA range reserved by the first kernel to be leveraged for GFP_MOVABLE during the lifetime of that kernel, but when used for kexec for the new kernel, it will use this scratch region as the only available memory. This guarantees that memory that we want to persist across kexec remains outside of the usage of the first kernel. James thinks of this as a chunk of ephemeral memory, memory that will be blown away during kexec to be used as early memory. ----->o----- Pasha Tatashin asked what attributes affect the reboot performance in terms of being proportional to the amount of memory preserved, for example if you have one large preserved region this would be fast because there is only thing we'd need to reserve. Mike noted that we wanted to use the scratch area all the way up until the buddy allocator was active. For preserved pages, we don't push these to buddy allocator freelists. Pasha noted if we had one large area of persistence, this may be faster, and suggested it would be up to the device to carve this memory out. Mike noted there were earlier proposals for this that were not well received, although could be an extension later but not as part of KHO. ----->o----- Pasha also asked whether persistence being specified during page allocation was still being considered for the design. Mike believed this to be outside the scope of KHO. James chimed in that any unmovable allocation could potentially be in scope for being preserved, and the use of persistent memory pools[4] could be done on top of KHO. Callers could use these pools instead of general kernel memory allocations. Pasha noted this may be useful for some device drivers, although James noted changing allocations to do this was decided to not be a hard requirement. Pasha asked whether the DT could be recreated before the reboot or only during serialization time. Mike noted the DT was currently recreated on kexec load, not kexec execute, which was something still to address. Pasha noted that removing kexec load from the critical path would very much be needed. James clarified that this was not exactly kexec load, but rather KHO's own activate phase which is driven by userspace right before the actual kexec. Mike agreed that this should be decoupled from kexec load and that we should be able to create the DT long before kexec execute. ----->o----- Kirill asked if the use case would differ at all between crash kernels and this use case. Pasha noted that we could load the new kernel in the same way that the crash kernel is loaded today and then reboot when needed. Pasha noted this would be useful for storing state even before the shutdown, for things like kernel logs of the original kernel. ----->o----- Mike noted that the CMA area intended to be repurposed as the scratch area would be available as the same length for any future kexec as well. IOW, kernel n+1 would have the same size of scratch as kernel n. The CMA area can only allocate movable allocations, which guarantees that no persistent memory is allocated from the scratch space. Pratyush Yadev noted that if we run out of memory that we could extend the scratch area, we would be able to use ranges of memory that are not used but this is not guaranteed to be available. Mike acknowledged that running out of memory could become a problem since reclaiming from other memory and creating contiguous ranges out of it would only be best effort. I asked if this memory is carved out by the kernel command line and needs to be consistent between the original kernel and the kexec kernel. Mike clarified that the CMA area is currently sized automatically based on the number of memblock allocations in the first kernel. There was debate about whether this sizing should be automatically discovered or explicitly declared on the kernel command line, like crash kernel use cases. This memory is not throwaway memory since it can be used by the original kernel before kexec, since it can be used for movable allocations at runtime. ----->o----- Junaid Shahid asked about preserving any non-movable allocation, the concern was that the kexec kernel would be doing buddy allocations from the CMA region that were non-movable and if we preserve them then we would have less scratch area available for the following kernel after that. Mike clarified that the assumption was that we'll never preserve movable allocations and CMA does not allow non-movable allocations to be allocated from its areas, and the allocations using this scratch area are also not preserved. ----->o----- Pratyush noted he has been working on supporting persistent tmpfs on top of KHO using its own structure that has a "stable" version. Kirill expressed the concern that we need to be careful because it becomes part of the kernel ABI. Pratyush, would it be possible to share your example code that you were referring to in chat today in response to this thread? ----->o----- The question came up about whether we should port hugetlbfs to use KHO and whether this would be useful for database workloads. Matthew Wilcox noted hugetlbfs is used as a replacement for anonymous memory for the database, effectively as a way to share memory that would be anonymous across multiple processes (it's the database's page cache). Matthew was unsure if this would need to be persistent across kexec. Mike noted another example would be virtual machines that use hugetlbfs instead of VM_PFNMAP memory and has been thinking about how to do this by serializing the superblock and inodes. Matthew reiterated what he has said before at conferences, which is that we need to get filesystem experts involved when designing filesystems :) ----->o----- James observed that we should continue the discussion about when serialization happens: when we freeze things, what it means to become immutable, what point in the kexec process we do this. ----->o----- My takeaway: - we likely want to consider reboot performance as a key input into design decisions - we should determine whether persistent memory pools, used for allocations that indicate the require persistence across kexec, will be out of scope for the initial landing of KHO upstream + if outside of scope for the initial landing of KHO, we could discuss a layer on top of KHO that could be used for allocations that want to indicate their need for persistence at the time of allocation - we need to decide if DT recreation can happen as part of kexec execute instead of kexec load and how to avoid having kexec load as part of the critical path; Mike agreed this should be decoupled from kexec load - we should explore loading the kexec kernel, like for crash kernels, in memory whenever we want, which has some advantages noted above - we need to decide if the sizing of the scratch area should be done automatically based on memblock allocations or if it should be explicitly carved out by the kernel command line, including adding more buffer room if declared that will be needed for future kexec kernels - we need to decide what the minimal feature set required for the initial upstream landing would be and we brainstormed this as extending the scratch phase until the buddy allocator is up and running, adding KHO support for reserve_mem, NUMA support, and multiple scratch regions - we need to decide on the initial set of use cases for KHO, Mike noted that reserve_mem is likely the best fit here given its straight forward; we should also discuss if hugetlbfs is a likely candidate as well Mike also noted that he is planning on sending out the next verison of KHO based upon the original work from Alex by the end of this year. We'll be looking to continue the discussion on this topic, as well as guestmemfs as an in-memory persistent file system, to accelerate progress and land foundational support in the kernel. For this, we'd like to fork off a biweekly series focused exclusively on persistent guest memory and live updates through kexec. If you are interested in participating in this series of discussions, please let me know in email. Everybody is welcome to participate and we'll have summary email threads such as this one to follow-up on the mailing lists. Thanks! [1] https://lore.kernel.org/all/20240117144704.602-1-graf@xxxxxxxxxx/ [2] https://lore.kernel.org/lkml/cover.1372582754.git.vdavydov@xxxxxxxxxxxxx/ [3] https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yznaga@xxxxxxxxxx/ [4] https://lore.kernel.org/all/169645773092.11424.7258549771090599226.stgit@skinsburskii./ [5] https://lore.kernel.org/linux-mm/20231016233215.13090-1-madvenka@xxxxxxxxxxxxxxxxxxx/ [6] https://lore.kernel.org/all/20240805093245.889357-1-jgowans@xxxxxxxxxx Also interesting, pkernfs: https://lpc.events/event/17/contributions/1485/