> On 11. Oct 2024, at 14:36, Mediouni, Mohamed <mediou@xxxxxxxxx> wrote: > > > >> On 11. Oct 2024, at 14:04, David Hildenbrand <david@xxxxxxxxxx> wrote: >> >> On 10.10.24 17:52, Fares Mehanna wrote: >>>>> In a series posted a few years ago [1], a proposal was put forward to allow the >>>>> kernel to allocate memory local to a mm and thus push it out of reach for >>>>> current and future speculation-based cross-process attacks. We still believe >>>>> this is a nice thing to have. >>>>> >>>>> However, in the time passed since that post Linux mm has grown quite a few new >>>>> goodies, so we'd like to explore possibilities to implement this functionality >>>>> with less effort and churn leveraging the now available facilities. >>>>> >>>>> An RFC was posted few months back [2] to show the proof of concept and a simple >>>>> test driver. >>>>> >>>>> In this RFC, we're using the same approach of implementing mm-local allocations >>>>> piggy-backing on memfd_secret(), using regular user addresses but pinning the >>>>> pages and flipping the user/supervisor flag on the respective PTEs to make them >>>>> directly accessible from kernel. >>>>> In addition to that we are submitting 5 patches to use the secret memory to hide >>>>> the vCPU gp-regs and fp-regs on arm64 VHE systems. >>>> >>>> I'm a bit lost on what exactly we want to achieve. The point where we >>>> start flipping user/supervisor flags confuses me :) >>>> >>>> With secretmem, you'd get memory allocated that >>>> (a) Is accessible by user space -- mapped into user space. >>>> (b) Is inaccessible by kernel space -- not mapped into the direct map >>>> (c) GUP will fail, but copy_from / copy_to user will work. >>>> >>>> >>>> Another way, without secretmem, would be to consider these "secrets" >>>> kernel allocations that can be mapped into user space using mmap() of a >>>> special fd. That is, they wouldn't have their origin in secretmem, but >>>> in KVM as a kernel allocation. It could be achieved by using VM_MIXEDMAP >>>> with vm_insert_pages(), manually removing them from the directmap. >>>> >>>> But, I am not sure who is supposed to access what. Let's explore the >>>> requirements. I assume we want: >>>> >>>> (a) Pages accessible by user space -- mapped into user space. >>>> (b) Pages inaccessible by kernel space -- not mapped into the direct map >>>> (c) GUP to fail (no direct map). >>>> (d) copy_from / copy_to user to fail? >>>> >>>> And on top of that, some way to access these pages on demand from kernel >>>> space? (temporary CPU-local mapping?) >>>> >>>> Or how would the kernel make use of these allocations? >>>> >>>> -- >>>> Cheers, >>>> >>>> David / dhildenb >>> Hi David, >> >> Hi Fares! >> >>> Thanks for taking a look at the patches! >>> We're trying to allocate a kernel memory that is accessible to the kernel but >>> only when the context of the process is loaded. >>> So this is a kernel memory that is not needed to operate the kernel itself, it >>> is to store & process data on behalf of a process. The requirement for this >>> memory is that it would never be touched unless the process is scheduled on this >>> core. otherwise any other access will crash the kernel. >>> So this memory should only be directly readable and writable by the kernel, but >>> only when the process context is loaded. The memory shouldn't be readable or >>> writable by the owner process at all. >>> This is basically done by removing those pages from kernel linear address and >>> attaching them only in the process mm_struct. So during context switching the >>> kernel loses access to the secret memory scheduled out and gain access to the >>> new process secret memory. >>> This generally protects against speculation attacks, and if other process managed >>> to trick the kernel to leak data from memory. In this case the kernel will crash >>> if it tries to access other processes secret memory. >>> Since this memory is special in the sense that it is kernel memory but only make >>> sense in the term of the owner process, I tried in this patch series to explore >>> the possibility of reusing memfd_secret() to allocate this memory in user virtual >>> address space, manage it in a VMA, flipping the permissions while keeping the >>> control of the mapping exclusively with the kernel. >>> Right now it is: >>> (a) Pages not accessible by user space -- even though they are mapped into user >>> space, the PTEs are marked for kernel usage. >> >> Ah, that is the detail I was missing, now I see what you are trying to achieve, thanks! >> >> It is a bit architecture specific, because ... imagine architectures that have separate kernel+user space page table hierarchies, and not a simple PTE flag to change access permissions between kernel/user space. >> >> IIRC s390 is one such architecture that uses separate page tables for the user-space + kernel-space portions. >> >>> (b) Pages accessible by kernel space -- even though they are not mapped into the >>> direct map, the PTEs in uvaddr are marked for kernel usage. >>> (c) copy_from / copy_to user won't fail -- because it is in the user range, but >>> this can be fixed by allocating specific range in user vaddr to this feature >>> and check against this range there. >>> (d) The secret memory vaddr is guessable by the owner process -- that can also >>> be fixed by allocating bigger chunk of user vaddr for this feature and >>> randomly placing the secret memory there. >>> (e) Mapping is off-limits to the owner process by marking the VMA as locked, >>> sealed and special. >> >> Okay, so in this RFC you are jumping through quite some hoops to have a kernel allocation unmapped from the direct map but mapped into a per-process page table only accessible by kernel space. :) >> >> So you really don't want this mapped into user space at all (consequently, no GUP, no access, no copy_from_user ...). In this RFC it's mapped but turned inaccessible by flipping the "kernel vs. user" switch. >> >>> Other alternative (that was implemented in the first submission) is to track those >>> allocations in a non-shared kernel PGD per process, then handle creating, forking >>> and context-switching this PGD. >> >> That sounds like a better approach. So we would remove the pages from the shared kernel direct map and map them into a separate kernel-portion in the per-MM page tables? >> >> Can you envision that would also work with architectures like s390x? I assume we would not only need the per-MM user space page table hierarchy, but also a per-MM kernel space page table hierarchy, into which we also map the common/shared-among-all-processes kernel space page tables (e.g., directmap). > Yes, that’s also applicable to arm64. There’s currently no separate per-mm user space page hierarchy there. typo, read kernel Thanks, -Mohamed >>> What I like about the memfd_secret() approach is the simplicity and being arch >>> agnostic, what I don't like is the increased attack surface by using VMAs to >>> track those allocations. >> >> Yes, but memfd_secret() was really design for user space to hold secrets. But I can see how you came to this solution. >> >>> I'm thinking of working on a PoC to implement the first approach of using a >>> non-shared kernel PGD for secret memory allocations on arm64. This includes >>> adding kernel page table per process where all PGDs are shared but one which >>> will be used for secret allocations mapping. And handle the fork & context >>> switching (TTBR1 switching(?)) correctly for the secret memory PGD. >>> What do you think? I'd really appreciate opinions and possible ways forward. >> >> Naive question: does arm64 rather resemble the s390x model or the x86-64 model? > arm64 has separate page tables for kernel and user-mode. Except for the KPTI case, the kernel page tables aren’t swapped per-process and stay the same all the time. > > Thanks, > -Mohamed >> -- >> Cheers, >> >> David / dhildenb >> > Amazon Web Services Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597