On Tue, 2020-10-20 at 15:20 +0200, David Hildenbrand wrote: > On 20.10.20 14:18, David Hildenbrand wrote: > > On 20.10.20 08:18, Kirill A. Shutemov wrote: > > > If the protected memory feature enabled, unmap guest memory from > > > kernel's direct mappings. > > > > Gah, ugly. I guess this also defeats compaction, swapping, ... oh > > gosh. > > As if all of the encrypted VM implementations didn't bring us > > enough > > ugliness already (SEV extensions also don't support reboots, but > > can at > > least kexec() IIRC). > > > > Something similar is done with secretmem [1]. And people don't seem > > to > > like fragmenting the direct mapping (including me). > > > > [1] https://lkml.kernel.org/r/20200924132904.1391-1-rppt@xxxxxxxxxx > > > > I just thought "hey, we might have to replace pud/pmd mappings by > page > tables when calling kernel_map_pages", this can fail with -ENOMEM, > why > isn't there proper error handling. > > Then I dived into __kernel_map_pages() which states: > > "The return value is ignored as the calls cannot fail. Large pages > for > identity mappings are not used at boot time and hence no memory > allocations during large page split." > > I am probably missing something important, but how is calling > kernel_map_pages() safe *after* booting?! I know we use it for > debug_pagealloc(), but using it in a production-ready feature feels > completely irresponsible. What am I missing? > My understanding was that DEBUG_PAGEALLOC maps the direct map at 4k post-boot as well so that kernel_map_pages() can't fail for it, and to avoid having to take locks for splits in interrupts. I think you are on to something though. That function is essentially a set_memory_() functions on x86 at least, and many callers do not check the error codes. Kees Cook has been pushing to fix this actually: https://github.com/KSPP/linux/issues/7 As long as a page has been broken to 4k, it should be able to be re- mapped without failure (like debug page alloc relies on). If an initial call to restrict permissions needs and fails to break a page, its remap does not need to succeed either before freeing the page, since the page never got set with a lower permission. That's my understanding from x86's cpa at least. The problem becomes that the page silently doesn't have its expected protections during use. Unchecked set_memory_() caching property change failures might result in functional problems though. So there is some background if you wanted it, but yea, I agree this feature should handle if the unmap failed. Probably return an error on the IOCTL and maybe the hypercall. kernel_map_pages() also doesn't do a shootdown though, so this direct map protection is more in the best effort category in its current state. For unmapping causing direct map fragmentation. The two techniques floating around, besides breaking indiscriminately, seem to be: 1. Group and cache unmapped pages to minimize the damage (like secret mem) 2. Somehow repair them back to large pages when reset RW (as Kirill had tried earlier this year in another series) I had imagined this usage would want something like that eventually, but neither helps with the other limitations you mentioned (migration, etc).