Re: [RFC PATCH 0/7] support for mm-local memory allocations and use it

Mike Rapoport <rppt@xxxxxxxxxx> · Fri, 27 Sep 2024 10:08:01 +0300

On Wed, Sep 25, 2024 at 03:33:47PM +0000, Fares Mehanna wrote:
> Hi,
> 
> Thanks for taking a look and apologies for my delayed response.
> 
> > Having a VMA in user mappings for kernel memory seems weird to say the
> > least.
> 
> I see your point and agree with you. Let me explain the motivation, pros and
> cons of the approach after answering your questions.
> 
> > Core MM does not expect to have VMAs for kernel memory. What will happen if
> > userspace ftruncates that VMA? Or registers it with userfaultfd?
> 
> In the patch, I make sure the pages are faulted in, locked and sealed to make
> sure the VMA is practically off-limits from the owner process. Only after that
> I change the permissions to be used by the kernel.

And what about VMA accesses from the kernel? How do you verify that
everything that works with VMAs in the kernel can deal with that being a
kernel mapping rather than userspace?

> > This approach seems much more reasonable and it's not that it was entirely
> > arch-specific. There is some plumbing at arch level, but the allocator is
> > anyway arch-independent. 
> 
> So I wanted to explore a simple solution to implement mm-local kernel secret
> memory without much arch dependent code. I also wanted to reuse as much of
> memfd_secret() as possible to benefit from what is done already and possible
> future improvements to it.

Adding functionality that normally belongs to userspace into mm/secretmem.c
does not feel like a reuse, sorry.

The only thing your actually share is removal of the allocated pages from
the direct map. And hijacking userspace mapping instead of properly
implementing a kernel mapping does not seem like proper solution.

> Keeping the secret pages in user virtual addresses is easier as the page table
> entries are not global by default so no special handling for spawn(). keeping
> them tracked in VMA shouldn't require special handling for fork().
> 
> The challenge was to keep the virtual addresses / VMA away from user control as
> long as the kernel is using it, and signal the mm core that this VMA is special
> so it is not merged with other VMAs.
> 
> I believe locking the pages, sealing the VMA, prefaulting the pages should make
> it practicality away of user space influence.
> 
> But the current approach have those downsides: (That I can think of)
> 1. Kernel secret user virtual addresses can still be used in functions accepting
>    user virtual addresses like copy_from_user / copy_to_user.
> 2. Even if we are sure the VMA is off-limits to userspace, adding VMA with
>    kernel addresses will increase attack surface between userspace and the
>    kernel.
> 3. Since kernel secret memory is mapped in user virtual addresses, it is very
>    easy to guess the exact virtual address (using binary search), and since
>    this functionality is designed to keep user data, it is fair to assume the
>    userspace will always be able to influence what is written there.
>    So it kind of breaks KASLR for those specific pages.

There is even no need to guess, it will appear on /proc/pid/maps

> 4. It locks user virtual memory away, this may break some software if they
>    assumed they can mmap() into specific places.
> 
> One way to address most of those concerns while keeping the solution almost arch
> agnostic is is to allocate reasonable chunk of user virtual memory to be only
> used for kernel secret memory, and not track them in VMAs.
> This is similar to the old approach but instead of creating non-global kernel
> PGD per arch it will use chunk of user virtual memory. This chunk can be defined
> per arch, and this solution won't use memfd_secret().
> We can then easily enlighten the kernel about this range so the kernel can test
> for this range in functions like access_ok(). This approach however will make
> downside #4 even worse, as it will reserve bigger chunk of user virtual memory
> if this feature is enabled.
> 
> I'm also very okay switching back to the old approach with the expense of:
> 1. Supporting fewer architectures that can afford to give away single PGD.

Only few architectures can modify their direct map, and all these can spare
a PGD entry.

> 2. More complicated arch specific code.

On x86 similar code already exists for LDT, you may want to look at Andy's
comments on old proclocal posting:

https://lore.kernel.org/lkml/CALCETrXHbS9VXfZ80kOjiTrreM2EbapYeGp68mvJPbosUtorYA@xxxxxxxxxxxxxx/

> Also @graf mentioned how aarch64 uses TTBR0/TTBR1 for user and kernel page
> tables, I haven't looked at this yet but it probably means that kernel page
> table will be tracked per process and TTBR1 will be switched during context
> switching.
> 
> What do you think? I would appreciate your opinion before working on the next
> RFC patch set.
> 
> Thanks!
> Fares.
> 
> 
> 
> Amazon Web Services Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
> Sitz: Berlin
> Ust-ID: DE 365 538 597
> 

-- 
Sincerely yours,
Mike.