On Thu, 2023-07-06 at 09:41 -0700, Michael Kelley wrote: > In a CoCo VM when a page transitions from private to shared, or vice > versa, attributes in the PTE must be updated *and* the hypervisor > must > be notified of the change. Because there are two separate steps, > there's > a window where the settings are inconsistent. Normally the code that > initiates the transition (via set_memory_decrypted() or > set_memory_encrypted()) ensures that the memory is not being accessed > during a transition, so the window of inconsistency is not a problem. > However, the load_unaligned_zeropad() function can read arbitrary > memory > pages at arbitrary times, which could access a transitioning page > during > the window. In such a case, CoCo VM specific exceptions are taken > (depending on the CoCo architecture in use). Current code in those > exception handlers recovers and does "fixup" on the result returned > by > load_unaligned_zeropad(). Unfortunately, this exception handling and > fixup code is tricky and somewhat fragile. At the moment, it is > broken for both TDX and SEV-SNP. > > There's also a problem with the current code in paravisor scenarios: > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need > to be forwarded from the paravisor to the Linux guest, but there > are no architectural specs for how to do that. > > To avoid these complexities of the CoCo exception handlers, change > the core transition code in __set_memory_enc_pgtable() to do the > following: > > 1. Remove aliasing mappings > 2. Remove the PRESENT bit from the PTEs of all transitioning pages > 3. Flush the TLB globally > 4. Flush the data cache if needed > 5. Set/clear the encryption attribute as appropriate > 6. Notify the hypervisor of the page status change > 7. Add back the PRESENT bit > > With this approach, load_unaligned_zeropad() just takes its normal > page-fault-based fixup path if it touches a page that is > transitioning. > As a result, load_unaligned_zeropad() and CoCo VM page transitioning > are completely decoupled. CoCo VM page transitions can proceed > without needing to handle architecture-specific exceptions and fix > things up. This decoupling reduces the complexity due to separate > TDX and SEV-SNP fixup paths, and gives more freedom to revise and > introduce new capabilities in future versions of the TDX and SEV-SNP > architectures. Paravisor scenarios work properly without needing > to forward exceptions. > > This approach may make __set_memory_enc_pgtable() slightly slower > because of touching the PTEs three times instead of just once. But > the run time of this function is already dominated by the hypercall > and the need to flush the TLB at least once and maybe twice. In any > case, this function is only used for CoCo VM page transitions, and > is already unsuitable for hot paths. Excluding vm_unmap_aliases(), and just looking at the TLB flushes, it kind of looks like this: 1. Clear present 2. TLB flush 3. Set C bit 4. Set Present bit 5. TLB flush But if you instead did: 1. Clear Present and set C bit 2. TLB flush 3. Set Present bit (no flush) Then you could still have only 1 TLB flush and 2 operations instead of 3. Otherwise it's the same load_unaligned_zeropad() benefits you are looking for I think. But I'm not very educated on the private<->shared conversion HW rules though, so maybe not. > > The architecture specific callback function for notifying the > hypervisor typically must translate guest kernel virtual addresses > into guest physical addresses to pass to the hypervisor. Because > the PTEs are invalid at the time of callback, the code for doing the > translation needs updating. virt_to_phys() or equivalent continues > to work for direct map addresses. But vmalloc addresses cannot use > vmalloc_to_page() because that function requires the leaf PTE to be > valid. Instead, slow_virt_to_phys() must be used. Both functions > manually walk the page table hierarchy, so performance is the same. Just curious. Are vmalloc addresses supported here? It looks like in SEV, but not TDX.