Re: Re: [PATCH v17 19/35] arch/mm: Export direct {un,}map functions

Elliot Berman <quic_eberman@xxxxxxxxxxx> · Fri, 23 Feb 2024 16:37:23 -0800

On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote:
> On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> > Firmware and hypervisor drivers can donate system heap memory to their
> > respective firmware/hypervisor entities. Those drivers should unmap the
> > pages from the kernel's logical map before doing so.
> > 
> > Export can_set_direct_map, set_direct_map_invalid_noflush, and
> > set_direct_map_default_noflush.
> 
> Err, not they should not.  And not using such super low-level interfaces
> from modular code.

Hi Cristoph,

We've observed a few times that Linux can unintentionally access a page
we've unmapped from host's stage 2 page table via an unaligned load from
an adjacent page. The stage 2 is managed by Gunyah. There are few
scenarios where even though we allocate and own a page from buddy,
someone else could try to access the page without going through the
hypervisor driver. One such instance we know about is
load_unaligned_zeropad() via pathlookup_at() [1].

load_unaligned_zeropad() could be called near the end of a page. If the
next page isn't mapped by the kernel in the stage one page tables, then
the access from to the unmapped page from load_unaligned_zeropad() will
land in __do_kernel_fault(), call fixup_exception(), and fill the
remainder of the load with zeroes. If the page in question is mapped in
stage 1 but was unmapped from stage 2, then the access lands back in
Linux in do_sea(), leading to a panic().

Our preference would be to add fixup_exception() to S2 PTW errors for
two reasons:
1. It's cheaper to do performance wise: we've already manipulated S2
   page table and prevent intentional access to the page because
   pKVM/Gunyah drivers know that access to the page has been lost.
2. Page-granular S1 mappings only happen on arm64 with rodata=full.

In an off-list discussion with the Android pkvm folks, their preference
was to have the pages unmapped from stage 1. I've gone with that
approach to get started but welcome discussion on the best approach.

The Android (downstream) implementation of arm64 pkvm is currently
implementing a hack where s2 ptw faults are given back to the host as s1
ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) --
allowing the kernel to fixup the exception.

arm64 pKVM will also face this issue when implementing guest_memfd or
when donating more memory to the hyp for s2 page tables, etc. As far as
I can tell, this isn't an issue for arm64 pKVM today because memory
isn't being dynamically donated to the hypervisor.

Thanks,
Elliot

[1]:
path_lookupat+0x340/0x3228
filename_lookup+0xbc/0x1c0
__arm64_sys_newfstatat+0xb0/0x4a0
invoke_syscall+0x58/0x118