On 2/2/21 10:02 AM, Kirill A. Shutemov wrote: > On Mon, Feb 01, 2021 at 05:51:09PM -0800, David Rientjes wrote: >> Hi everybody, >> >> I'd like to kick-start the discussion on lazy validation of guest memory >> for the purposes of AMD SEV-SNP and Intel TDX. >> >> Both AMD SEV-SNP and Intel TDX require validation of guest memory before >> it may be used by the guest. This is needed for integrity protection from >> a potentially malicious hypervisor or other host components. >> >> For AMD SEV-SNP, the hypervisor assigns a page to the guest using the new >> RMPUPDATE instruction. The guest then transitions the page to a usable by >> the new PVALIDATE instruction[1]. This sets the Validated flag in the >> Reverse Map Table (RMP) for a guest addressable page, which opts into >> hardware and firmware integrity protection. This may only be done by the >> guest itself and until that time, the guest cannot access the page. >> >> The guest can only PVALIDATE memory for a gPA once; the RMP then >> guarantees for each hPA that there is only a single gPA mapping. This >> validation can either be done all up front at the time the guest is booted >> or it can be done lazily at runtime on fault if the guest keeps track of >> Valid vs Invalid pages. Because doing PVALIDATE for all guest memory at >> boot would be extremely lengthy, I'd like to discuss the options for doing >> it lazily. >> >> Similarly, for Intel TDX, the hypervisor unmaps the gPA from the shared >> EPT and invalidates the tlb and all caches for the TD's vcpus; it then >> adds a page to the gPA address space for a TD by using the new >> TDH.MEM.PAGE.AUG call. The TDG.MEM.PAGE.ACCEPT TDCALL[2] then allows a >> guest to accept a guest page for a gPA and initialize it using the private >> key for that TD. This may only be done by the TD itself and until that >> time, the gPA cannot be used within the TD. >> >> Both AMD SEV-SNP and Intel TDX support hugepages. SEV-SNP supports 2MB >> whereas TDX has accept TDCALL support for 2MB and 1GB. >> >> I believe the UEFI ECR[3] for the unaccepted memory type to >> EFI_MEMORY_TYPE was accepted in December. This should enable the guest to >> learn what memory has not yet been validated (or accepted) by the firmware >> if all guest memory is not done completely up front. >> >> This likely requires a pre-validation of all memory that can be accessed >> when handling a #VC (or #VE for TDX) such as IST stacks, including memory >> in the x86 boot sequence that must be validated before the core mm >> subsystem is up and running to handle the lazy validation. I believe >> lazy validation can be done by the core mm after that, perhaps by >> maintaining a new "validated" bit in struct page flags. >> >> Has anybody looked into this or, even better, is anybody currently working >> on this? > It's likely I'm going to do this on Intel side, but I have not looked > deeply into it. > >> I think quite invasive changes are needed for the guest to support lazy >> validation/acceptance to core areas that lots of people on the recipient >> list have strong opinions about. Some things that come to mind: >> >> - Annotations for pages that must be pre-validated in the x86 boot >> sequence, including IST stacks >> >> - Proliferation of these annotations throughout any kernel code that can >> access memory for #VC or #VE >> >> - Handling lazy validation of guest memory through the core mm layer, >> most likely involving a bit in struct page flags to track their status >> >> - Any need for validating memory that is not backed by struct page that >> needs to be special-cased >> >> - Any concerns about this for the DMA layer >> >> One possibility for minimal disruption to the boot entry code is to >> require the guest BIOS to validate 4GB and below, and then leave 4GB and >> above to be done lazily (the true amount of memory will actually be less >> due to the MMIO hole). > [ As I didn't looked into actual code, I may say total garbage below... ] > > Pre-validating 4GB would indeed be easiest way to go, but it's going to be > too slow. > > The more realistic is for BIOS to pre-validate memory where kernel and > initrd are placed, plus few dozen megs for runtime. It means decompression > code would need to be aware about the validation. I was thinking that BIOS validating the lower 4GB will simplify the changes to the kernel entry code path as well provide a clean approach to support kexec. My initial thought is - BIOS or VMM validate lower 4GB memory. - BIOS mark the higher 4GB as unaccepted in e820/efi memmap - Kernel early boot can be achieved with minimal (or no changes) - If there is an unaccepted type discovered then allocate a bitmap that can be used to keep track of information (e.g which pages are validated). We can also explore whether removing the unaccepted flag from the memmap range will work. - On #VC/#VE, look at the bitmap to see if we need to validate the pages. To speed up, we can validate more than one page on #VC/#VE. - If we get kexec'd then rebuild the e820/memmap based on the bitmap so that we don't double validate. > > The critical thing is that once memory is validate we must not validate > it again. It's possible VMM->guest attack vector. We must track precisely > what memory has been validated and stop the guest on detecting the > unexpected second validation request. > > It also means that we has to keep the information when the control gets > passed from decompression code to the real kernel. Page flag is no good > for this. > > My initial thought that we can use e820/efi memmap to keep track of > information -- remove the unaccepted memory flag from the range that got > accepted. > > The decompression code validates the memory that it's need for > decompression, modify memmap accordingly and pass control to the main > kernel. > > The main kernel may accept the memory via #VE/#VC, but ideally it need to > stay within memory accepted by decompression code for initial boot. > > I think the bulk of memory validation can be done via existing machinery: > we have already deferred struct page initialization code in kernel and I > believe we can hook up into it for the purpose. > > Any comments? >