Re: [RFC PATCH 0/7] support for mm-local memory allocations and use it

Fares Mehanna <faresx@xxxxxxxxx> · Wed, 25 Sep 2024 15:33:47 +0000

Hi,

Thanks for taking a look and apologies for my delayed response.

> Having a VMA in user mappings for kernel memory seems weird to say the
> least.

I see your point and agree with you. Let me explain the motivation, pros and
cons of the approach after answering your questions.

> Core MM does not expect to have VMAs for kernel memory. What will happen if
> userspace ftruncates that VMA? Or registers it with userfaultfd?

In the patch, I make sure the pages are faulted in, locked and sealed to make
sure the VMA is practically off-limits from the owner process. Only after that
I change the permissions to be used by the kernel.

> This approach seems much more reasonable and it's not that it was entirely
> arch-specific. There is some plumbing at arch level, but the allocator is
> anyway arch-independent. 

So I wanted to explore a simple solution to implement mm-local kernel secret
memory without much arch dependent code. I also wanted to reuse as much of
memfd_secret() as possible to benefit from what is done already and possible
future improvements to it.

Keeping the secret pages in user virtual addresses is easier as the page table
entries are not global by default so no special handling for spawn(). keeping
them tracked in VMA shouldn't require special handling for fork().

The challenge was to keep the virtual addresses / VMA away from user control as
long as the kernel is using it, and signal the mm core that this VMA is special
so it is not merged with other VMAs.

I believe locking the pages, sealing the VMA, prefaulting the pages should make
it practicality away of user space influence.

But the current approach have those downsides: (That I can think of)
1. Kernel secret user virtual addresses can still be used in functions accepting
   user virtual addresses like copy_from_user / copy_to_user.
2. Even if we are sure the VMA is off-limits to userspace, adding VMA with
   kernel addresses will increase attack surface between userspace and the
   kernel.
3. Since kernel secret memory is mapped in user virtual addresses, it is very
   easy to guess the exact virtual address (using binary search), and since
   this functionality is designed to keep user data, it is fair to assume the
   userspace will always be able to influence what is written there.
   So it kind of breaks KASLR for those specific pages.
4. It locks user virtual memory away, this may break some software if they
   assumed they can mmap() into specific places.

One way to address most of those concerns while keeping the solution almost arch
agnostic is is to allocate reasonable chunk of user virtual memory to be only
used for kernel secret memory, and not track them in VMAs.
This is similar to the old approach but instead of creating non-global kernel
PGD per arch it will use chunk of user virtual memory. This chunk can be defined
per arch, and this solution won't use memfd_secret().
We can then easily enlighten the kernel about this range so the kernel can test
for this range in functions like access_ok(). This approach however will make
downside #4 even worse, as it will reserve bigger chunk of user virtual memory
if this feature is enabled.

I'm also very okay switching back to the old approach with the expense of:
1. Supporting fewer architectures that can afford to give away single PGD.
2. More complicated arch specific code.

Also @graf mentioned how aarch64 uses TTBR0/TTBR1 for user and kernel page
tables, I haven't looked at this yet but it probably means that kernel page
table will be tracked per process and TTBR1 will be switched during context
switching.

What do you think? I would appreciate your opinion before working on the next
RFC patch set.

Thanks!
Fares.

Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597