On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote: > On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote: > > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote: > > > Firmware and hypervisor drivers can donate system heap memory to their > > > respective firmware/hypervisor entities. Those drivers should unmap the > > > pages from the kernel's logical map before doing so. > > > > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and > > > set_direct_map_default_noflush. > > > > Err, not they should not. And not using such super low-level interfaces > > from modular code. > > Hi Cristoph, > > We've observed a few times that Linux can unintentionally access a page > we've unmapped from host's stage 2 page table via an unaligned load from > an adjacent page. The stage 2 is managed by Gunyah. There are few > scenarios where even though we allocate and own a page from buddy, > someone else could try to access the page without going through the > hypervisor driver. One such instance we know about is > load_unaligned_zeropad() via pathlookup_at() [1]. > > load_unaligned_zeropad() could be called near the end of a page. If the > next page isn't mapped by the kernel in the stage one page tables, then > the access from to the unmapped page from load_unaligned_zeropad() will > land in __do_kernel_fault(), call fixup_exception(), and fill the > remainder of the load with zeroes. If the page in question is mapped in > stage 1 but was unmapped from stage 2, then the access lands back in > Linux in do_sea(), leading to a panic(). > > Our preference would be to add fixup_exception() to S2 PTW errors for > two reasons: > 1. It's cheaper to do performance wise: we've already manipulated S2 > page table and prevent intentional access to the page because > pKVM/Gunyah drivers know that access to the page has been lost. > 2. Page-granular S1 mappings only happen on arm64 with rodata=full. > > In an off-list discussion with the Android pkvm folks, their preference > was to have the pages unmapped from stage 1. I've gone with that > approach to get started but welcome discussion on the best approach. > > The Android (downstream) implementation of arm64 pkvm is currently > implementing a hack where s2 ptw faults are given back to the host as s1 > ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) -- > allowing the kernel to fixup the exception. > > arm64 pKVM will also face this issue when implementing guest_memfd or > when donating more memory to the hyp for s2 page tables, etc. As far as > I can tell, this isn't an issue for arm64 pKVM today because memory > isn't being dynamically donated to the hypervisor. FWIW pKVM already donates memory dynamically to the hypervisor, to store e.g. guest VM metadata and page-tables, and we've never seen that problem as far as I can recall. A key difference is that pKVM injects a data abort back into the kernel in case of a stage-2 fault, so the whole EXTABLE trick/hack in load_unaligned_zeropad() should work fine out of the box. As discussed offline, Gunyah injecting an SEA into the kernel is questionable, but I understand that the architecture is a bit lacking in this department, and that's probably the next best thing. Could the Gunyah driver allocate from a CMA region instead? That would surely simplify unmapping from EL1 stage-1 (similar to how drivers usually donate memory to TZ). Thanks, Quentin