From: Edgecombe, Rick P <rick.p.edgecombe@xxxxxxxxx> Sent: Tuesday, November 28, 2023 10:59 AM > > On Tue, 2023-11-28 at 18:08 +0000, Michael Kelley wrote: > > > > > > Sort of separately, if those vmalloc objections can't be worked > > > through, did you consider doing something like text_poke() does > > > (create > > > the temporary mapping in a temporary MM) for pvalidate purposes? I > > > don't know enough about what kind of special exceptions might popup > > > during that operation though, might be playing with fire... > > > > Interesting idea. But from a quick glance at the text_poke() code, > > such an approach seems somewhat complex, and I suspect it will have > > the same perf issues (or worse) as creating a new vmalloc area for > > each PVALIDATE invocation. > > Using new vmalloc area's will eventually result in a kernel shootdown, > but usually have no flushes. text_poke will always result in a local- > only flush. So at least whatever slowdown there is would only affect > the calling thread. > > As for complexity, I think it might be simple to implement actually. > What kind of special exceptions could come out of pvalidate, I'm not so > sure. But the kernel terminates the VM on failure anyway, so maybe it's > not an issue? Sorry for the delay in getting back to this topic. OK, I see now what you are suggesting. For each page that needs to be PVALIDATE'd, use __text_poke() to create the temp mapping and run PVALIDATE. However, there are some problems. __text_poke() runs vmalloc_to_page() for addresses that aren't core kernel text, and vmalloc_to_page() will fail if the PTE "present" bit has been cleared. That could be easily addressed by changing it to use slow_virt_to_phys(). But PVALIDATE also needs to be able to return the PVALIDATE_FAIL_SIZEMISMATCH error code, which is tested for in pvalidate_pages() and does not terminate the VM. __text_poke() doesn’t have the machinery to return such an error code, and that's harder to fix. There's also the conceptual issue. The PVALIDATE use case isn't working with a text area, so that case would be abusing "text_poke" a bit. I could imagine __text_poke() having code to verify that it's working on a text area, even if that code isn't there now. To get a sense of performance, I hacked the equivalent of text_poke() to work with PVALIDATE. The biggest case of transitioning pages from encrypted to decrypted is the swiotlb area, which is 1 Gbyte for a VM with 16 Gbytes or more of memory. In a Hyper-V CoCo VM, current code takes about 270 milliseconds to transition that 1 Gbyte swiotlb area. With my initial approach using vmap_pages_range(), that 270 ms went to 319 ms, which is fairly negligible in the overall VM boot time. Using the text_poke() approach increased the time to 368 ms, which is bigger but still probably not a show-stopper. It's definitely faster than creating a new vmalloc area for each page that needs to be PVALIDATE'd, which adds about 6 seconds to the boot time. All-in-all, I'm back to my Plan B, which is to mark the pages "not present" only in configurations where the hypervisor callbacks operate on physical addresses instead of virtual addresses. Since SEV-SNP needs virtual addresses, it will need to handle exceptions generated by load_unaligned_zeropad() and do the appropriate fixup. I'll try to get my "Plan B" patch set posted in a new few days. Michael > > diff --git a/arch/x86/kernel/alternative.c > b/arch/x86/kernel/alternative.c > index 73be3931e4f0..a13293564eeb 100644 > --- a/arch/x86/kernel/alternative.c > +++ b/arch/x86/kernel/alternative.c > @@ -1905,6 +1905,16 @@ void *text_poke(void *addr, const void *opcode, > size_t len) > return __text_poke(text_poke_memcpy, addr, opcode, len); > } > > +static void text_poke_pvalidate(void *dst, const void *src, size_t > len) > +{ > + pvalidate(dst, len, true); // if fail, terminate > +} > + > +void *pvalidated_poke(void *addr) > +{ > + return __text_poke(text_poke_pvalidate, addr, NULL, PAGE_SIZE); > +} > + > /** > * text_poke_kgdb - Update instructions on a live kernel by kgdb > * @addr: address to modify > > > > > > > At this point, the complexity of creating the temp mapping for > > PVALIDATE is seeming excessive. On balance it seems simpler to > > revert to an approach where the use of set_memory_np() and > > set_memory_p() is conditional. It would be necessary when #VC > > and #VE exceptions are directed to a paravisor. (This assumes the > > paravisor interface in the hypervisor callbacks does the natural thing > > of working with physical addresses, so there's no need for a temp > > mapping.) > > > > Optionally, the set_memory_np()/set_memory_p() approach could > > be used in other cases where the hypervisor callbacks work with > > physical addresses. But it can't be used with cases where the > > hypervisor callbacks need valid virtual addresses. > > > > So on net, set_memory_np()/set_memory_p() would be used in > > the Hyper-V cases of TDX and SEV-SNP with a paravisor. It could > > optionally be used with TDX with no paravisor, but my sense is > > that Kirill wants to keep TDX "as is" and let the exception handlers > > do the load_unaligned_zeropad() fixup. > > > > It could not be used with SEV-SNP with no paravisor. Additional fixes > > may be needed on the SEV-SNP side to properly fixup > > load_unaligned_zeropad() accesses to a page that's in transition > > between encrypted and decrypted. > > > > Yea, I don't know about this paravisor/exception stuff.