From: kirill.shutemov@xxxxxxxxxxxxxxx <kirill.shutemov@xxxxxxxxxxxxxxx> Sent: Monday, August 28, 2023 3:14 PM > > On Mon, Aug 28, 2023 at 09:00:03PM +0000, Michael Kelley (LINUX) wrote: > > From: kirill.shutemov@xxxxxxxxxxxxxxx <kirill.shutemov@xxxxxxxxxxxxxxx> Sent: Monday, August 28, 2023 9:13 AM > > > > > > On Mon, Aug 28, 2023 at 02:22:11PM +0000, Michael Kelley (LINUX) wrote: > > > > From: Michael Kelley (LINUX) <mikelley@xxxxxxxxxxxxx> Sent: Tuesday, August 15, 2023 7:54 PM > > > > > > > > > > From: kirill.shutemov@xxxxxxxxxxxxxxx <kirill.shutemov@xxxxxxxxxxxxxxx> Sent: Sunday, > > > > > August 6, 2023 3:20 PM > > > > > > > > > > > > On Thu, Jul 06, 2023 at 09:41:59AM -0700, Michael Kelley wrote: > > > > > > > In a CoCo VM when a page transitions from private to shared, or vice > > > > > > > versa, attributes in the PTE must be updated *and* the hypervisor must > > > > > > > be notified of the change. Because there are two separate steps, there's > > > > > > > a window where the settings are inconsistent. Normally the code that > > > > > > > initiates the transition (via set_memory_decrypted() or > > > > > > > set_memory_encrypted()) ensures that the memory is not being accessed > > > > > > > during a transition, so the window of inconsistency is not a problem. > > > > > > > However, the load_unaligned_zeropad() function can read arbitrary memory > > > > > > > pages at arbitrary times, which could access a transitioning page during > > > > > > > the window. In such a case, CoCo VM specific exceptions are taken > > > > > > > (depending on the CoCo architecture in use). Current code in those > > > > > > > exception handlers recovers and does "fixup" on the result returned by > > > > > > > load_unaligned_zeropad(). Unfortunately, this exception handling and > > > > > > > fixup code is tricky and somewhat fragile. At the moment, it is > > > > > > > broken for both TDX and SEV-SNP. > > > > > > > > > > > > > > > > Thanks for looking at this. I'm finally catching up after being out on > > > > > vacation for a week. > > > > > > > > > > > I believe it is not fixed for TDX. Is it still a problem for SEV-SNP? > > > > > > > > > > I presume you meant "now fixed for TDX", which I agree with. Tom > > > > > Lendacky has indicated that there's still a problem with SEV-SNP. He > > > > > could fix that problem, but this approach of marking the pages > > > > > invalid obviates the need for Tom's fix. > > > > > > > > > > > > > > > > > > There's also a problem with the current code in paravisor scenarios: > > > > > > > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need > > > > > > > to be forwarded from the paravisor to the Linux guest, but there > > > > > > > are no architectural specs for how to do that. > > > > > > > > > > The TD Partitioning case (and the similar SEV-SNP vTOM mode case) is > > > > > what doesn't have a solution. To elaborate, with TD Partitioning, #VE > > > > > is sent to the containing VM, not the main Linux guest VM. For > > > > > everything except an EPT violation, the containing VM can handle > > > > > the exception on behalf of the main Linux guest by doing the > > > > > appropriate emulation. But for an EPT violation, the containing > > > > > VM can only terminate the guest. It doesn't have sufficient context > > > > > to handle a "valid" #VE with EPT violation generated due to > > > > > load_unaligned_zeropad(). My proposed scheme of marking the > > > > > pages invalid avoids generating those #VEs and lets TD Partitioning > > > > > (and similarly for SEV-SNP vTOM) work as intended with a paravisor. > > > > > > > > > > > > > > > > > > > To avoid these complexities of the CoCo exception handlers, change > > > > > > > the core transition code in __set_memory_enc_pgtable() to do the > > > > > > > following: > > > > > > > > > > > > > > 1. Remove aliasing mappings > > > > > > > 2. Remove the PRESENT bit from the PTEs of all transitioning pages > > > > > > > 3. Flush the TLB globally > > > > > > > 4. Flush the data cache if needed > > > > > > > 5. Set/clear the encryption attribute as appropriate > > > > > > > 6. Notify the hypervisor of the page status change > > > > > > > 7. Add back the PRESENT bit > > > > > > > > > > > > Okay, looks safe. > > > > > > > > > > > > > With this approach, load_unaligned_zeropad() just takes its normal > > > > > > > page-fault-based fixup path if it touches a page that is transitioning. > > > > > > > As a result, load_unaligned_zeropad() and CoCo VM page transitioning > > > > > > > are completely decoupled. CoCo VM page transitions can proceed > > > > > > > without needing to handle architecture-specific exceptions and fix > > > > > > > things up. This decoupling reduces the complexity due to separate > > > > > > > TDX and SEV-SNP fixup paths, and gives more freedom to revise and > > > > > > > introduce new capabilities in future versions of the TDX and SEV-SNP > > > > > > > architectures. Paravisor scenarios work properly without needing > > > > > > > to forward exceptions. > > > > > > > > > > > > > > This approach may make __set_memory_enc_pgtable() slightly slower > > > > > > > because of touching the PTEs three times instead of just once. But > > > > > > > the run time of this function is already dominated by the hypercall > > > > > > > and the need to flush the TLB at least once and maybe twice. In any > > > > > > > case, this function is only used for CoCo VM page transitions, and > > > > > > > is already unsuitable for hot paths. > > > > > > > > > > > > > > The architecture specific callback function for notifying the > > > > > > > hypervisor typically must translate guest kernel virtual addresses > > > > > > > into guest physical addresses to pass to the hypervisor. Because > > > > > > > the PTEs are invalid at the time of callback, the code for doing the > > > > > > > translation needs updating. virt_to_phys() or equivalent continues > > > > > > > to work for direct map addresses. But vmalloc addresses cannot use > > > > > > > vmalloc_to_page() because that function requires the leaf PTE to be > > > > > > > valid. Instead, slow_virt_to_phys() must be used. Both functions > > > > > > > manually walk the page table hierarchy, so performance is the same. > > > > > > > > > > > > Uhmm.. But why do we expected slow_virt_to_phys() to work on non-present > > > > > > page table entries? It seems accident for me that it works now. Somebody > > > > > > forgot pte_present() check. > > > > > > > > > > I didn't research the history of slow_virt_to_phys(), but I'll do so. > > > > > > > > > > > > > The history of slow_virt_to_phys() doesn't show any discussion of > > > > whether it is expected to work for non-present PTEs. However, the > > > > page table walking is done by lookup_address(), which explicitly > > > > *does* work for non-present PTEs. For example, lookup_address() > > > > is used in cpa_flush() to find a PTE. cpa_flush() then checks to see if > > > > the PTE is present. > > > > > > Which is totally fine thing to do. Present bit is the only info you can > > > always rely to be valid for non-present PTE. > > > > > > But it is nitpicking on my side. Here you control lifecycle of the PTE, so > > > you know that PFN will stay valid for the PTE. > > > > > > I guess the right thing to do is to use lookup_address() and get pfn from > > > the PTE with the comment why it is valid. > > > > Each of the three implementations of the enc_status_change_prepare()/ > > enc_status_change_finish() callbacks needs to translate from vaddr to > > pfn, and so would use lookup_address(). For convenience, I would > > create a helper function that wraps lookup_address() and returns > > a PFN. This helper function would be very similar to slow_virt_to_phys() > > except for removing the shift to create a phys_addr_t from the PFN, > > and adding back the offset within the page. Is that the approach you > > had in mind? > > I guess in this case it is better to use slow_virt_to_phys(), but have a > comment somewhere why it is self. Maybe also add comment inside > slow_virt_to_phys() to indicate that we don't check present bit for a > reason. > OK, works for me. I'll turn my original RFC patch into a small patch set that's pretty much as the RFC version is coded, along with the appropriate comments in slow_virt_to_phys(). Additional patches in the set will remove code that becomes unnecessary or that can be simplified. Michael