Hi Andrei Thanks for your email. I was hoping to get some feedback from CRIU devs, and happy to see you reaching out.. On Mon, Dec 9, 2024 at 8:12 PM Andrei Vagin <avagin@xxxxxxxxx> wrote: > > On Mon, Nov 25, 2024 at 12:49 PM <jeffxu@xxxxxxxxxxxx> wrote: > > > > From: Jeff Xu <jeffxu@xxxxxxxxxxxx> > > > > Seal vdso, vvar, sigpage, uprobes and vsyscall. > > > > Those mappings are readonly or executable only, sealing can protect > > them from ever changing or unmapped during the life time of the process. > > For complete descriptions of memory sealing, please see mseal.rst [1]. > > > > System mappings such as vdso, vvar, and sigpage (for arm) are > > generated by the kernel during program initialization, and are > > sealed after creation. > > > > Unlike the aforementioned mappings, the uprobe mapping is not > > established during program startup. However, its lifetime is the same > > as the process's lifetime [2]. It is sealed from creation. > > > > The vdso, vvar, sigpage, and uprobe mappings all invoke the > > _install_special_mapping() function. As no other mappings utilize this > > function, it is logical to incorporate sealing logic within > > _install_special_mapping(). This approach avoids the necessity of > > modifying code across various architecture-specific implementations. > > > > The vsyscall mapping, which has its own initialization function, is > > sealed in the XONLY case, it seems to be the most common and secure > > case of using vsyscall. > > > > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may > > alter the mapping of vdso, vvar, and sigpage during restore > > operations. Consequently, this feature cannot be universally enabled > > across all systems. > > > ... > > > > +config SEAL_SYSTEM_MAPPINGS > > + bool "seal system mappings" > > + default n > > + depends on 64BIT > > + depends on ARCH_HAS_SEAL_SYSTEM_MAPPINGS > > + depends on !CHECKPOINT_RESTORE > > Hi Jeff, > > I like the idea of this patchset, but I don’t like the idea of > forcing users to choose between this security feature and > checkpoint/restore functionality. We need to explore ways to make this > feature work with checkpoint/restore. Relying on CAP_CHECKPOINT_RESTORE > is the obvious approach. > I agree that forcing users to choose isn't ideal. I'd prefer a solution where both approaches can be used in some way, depending on the situation and distributions. Hopefully, with input from CRIU developers, this can be achieved. However, it makes sense to unconditionally seal vdso/vvar for systems like ChromeOS and Android that don't currently use CRIU, so we will need a KCONFIG for that. > CRIU just needs to move these mappings, and it doesn't need to change > their properties or modify their contents. With that in mind, here are That is an important detail to know, thanks for bringing it up. > two options: > * Allow moving sealed mappings for processes with CAP_CHECKPOINT_RESTORE. We could try to propose this under a new KCONFIG, e.g. CONFIG_SEAL_SYSTEM_MAPPING_WITH_CAP_CHECK IIUC, You propose allowing userspace mremap vdso if the process has CAP_CHECKPOINT_RESTORE. However, I believe this approach raises security concerns. During the RFC for mseal, initially, I suggested sealing mmap, mremap, and munmap individually, but Linus rejected this proposal, for a good reason, mremap could leave an empty hole in the address space, thus allowing the attacker to fill it with attacker controlled content. Furthermore, CAP_SYSTEM_ADMIN allows setting any capacity, so this would become a by-pass for sealing. > * Allow temporarily "unsealing" mappings for processes with > CAP_CHECKPOINT_RESTORE. CRIU could unseal mappings, move them, and > then seal them back. > We could also try to propose this under CONFIG_SEAL_SYSTEM_MAPPING_FOR_CRIU. It's important to note that temporarily unsealing a mapping from userspace is not permitted. If a mapping has the capability to be unsealed, it fundamentally does not provide the sealing property. Perhaps the intention was for these steps to be carried out within the kernel? e.g. the userspace could instruct the kernel to relocate the vdso mapping. Since the kernel can ensure the vdso contents are not manipulated by an attacker, this approach could offer a viable solution. I have been thinking of other alternatives, but those would require more understanding on CRIU use cases. One of my questions is: Would CRIU target an individual process? or entire systems? If it is an individual process, we could use prctl to opt-in/opt-out certain processes. There could be two alternatives. 1> Opt-in solution: process must set prctl.seal_criu_mapping, this needs to be set before execve() because sealing is applied at execve() call. 2> opt-out solution: The system will by default seal all of the system mappings, but individual processes can opt-out by setting prctl.not_seal_criu_mappings. This also needs to be set before execve() call. For both cases, we will want to identify what type of mapping CRIU cares about, i.e. maybe CRIU doesn't care about uprobe and vsyscall ? and only care about vdso/vvar/sigpage ? > Another approach might be to make this feature configurable on a > per-process basis (e.g., via prctl). Once enabled for a process, it > would be inherited by all its children. It can't be disabled unless a > process has CAP_CHECKPOINT_RESTORE. > > I've added Mike, Dima, and Alex to the thread. They might have > other ideas. > Thanks. Please feel free to chime in, I will also add Mike,Dima and Alex to the new version of this series as well. Thanks! -Jeff > Thanks, > Andrei