Re: [PATCH RFC v1 0/9] KVM: SVM: Defer page pinning for SEV guests

"Nikunj A. Dadhania" <nikunj@xxxxxxx> · Wed, 30 Mar 2022 10:12:24 +0530

On 3/29/2022 2:30 AM, Sean Christopherson wrote:
> On Tue, Mar 08, 2022, Nikunj A Dadhania wrote:
>> This is a follow-up to the RFC implementation [1] that incorporates
>> review feedback and bug fixes. See the "RFC v1" section below for a 
>> list of changes.
> 
> Heh, for future reference, the initial posting of a series/patch/RFC is implicitly
> v1, i.e. this should be RFC v2.

Sure.

> 
>> SEV guest requires the guest's pages to be pinned in host physical
>> memory as migration of encrypted pages is not supported. The memory
>> encryption scheme uses the physical address of the memory being
>> encrypted. If guest pages are moved by the host, content decrypted in
>> the guest would be incorrect thereby corrupting guest's memory.
>>
>> For SEV/SEV-ES guests, the hypervisor doesn't know which pages are
>> encrypted and when the guest is done using those pages. Hypervisor
>> should treat all the guest pages as encrypted until they are 
>> deallocated or the guest is destroyed.
>>
>> While provision a pfn, make KVM aware that guest pages need to be 
>> pinned for long-term and use appropriate pin_user_pages API for these
>> special encrypted memory regions. KVM takes the first reference and
>> holds it until a mapping is done. Take an extra reference before KVM
>> releases the pfn. 
>>
>> Actual pinning management is handled by vendor code via new
>> kvm_x86_ops hooks. MMU calls in to vendor code to pin the page on
>> demand. Metadata of the pinning is stored in architecture specific
>> memslot area. During the memslot freeing path and deallocation path
>> guest pages are unpinned.
>>
>> Guest boot time comparison:
>> +---------------+----------------+-------------------+
>> | Guest Memory  |   baseline     |  Demand Pinning + |
>> | Size (GB)     | v5.17-rc6(secs)| v5.17-rc6(secs)   |
>> +---------------+----------------+-------------------+
>> |      4        |     6.16       |      5.71         |
>> +---------------+----------------+-------------------+
>> |     16        |     7.38       |      5.91         |
>> +---------------+----------------+-------------------+
>> |     64        |    12.17       |      6.16         |
>> +---------------+----------------+-------------------+
>> |    128        |    18.20       |      6.50         |
>> +---------------+----------------+-------------------+
>> |    192        |    24.56       |      6.80         |
>> +---------------+----------------+-------------------+
> 
> Let me preface this by saying I generally like the idea and especially the
> performance, but...
> 
> I think we should abandon this approach in favor of committing all our resources
> to fd-based private memory[*], which (if done right) will provide on-demand pinning
> for "free".  

I will give this a try for SEV, was on my todo list.

> I would much rather get that support merged sooner than later, and use
> it as a carrot for legacy SEV to get users to move over to its new APIs, with a long
> term goal of deprecating and disallowing SEV/SEV-ES guests without fd-based private
> memory.  

> That would require guest kernel support to communicate private vs. shared,

Could you explain this in more detail? This is required for punching hole for shared pages?

> but SEV guests already "need" to do that to play nice with live migration, so it's
> not a big ask, just another carrot to entice guests/customers to update their kernel
> (and possibly users to update their guest firmware).
> 
> This series isn't awful by any means, but it requires poking into core flows and
> further complicates paths that are already anything but simple.  And things like
> conditionally grabbing vCPU0 to pin pages in its MMU make me flinch.  And I think
> the situation would only get worse by the time all the bugs and corner cases are
> ironed out.  E.g. this code is wrong:
> 
>   void kvm_release_pfn_clean(kvm_pfn_t pfn)
>   {
> 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) {
> 		struct page *page = pfn_to_page(pfn);
> 
> 		if (page_maybe_dma_pinned(page))
> 			unpin_user_page(page);
> 		else
> 			put_page(page);
> 	}
>   }
>   EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
> 
> Because (a) page_maybe_dma_pinned() is susceptible to false positives (clearly
> documented), and (b) even if it didn't get false positives, there's no guarantee
> that _KVM_ owns a pin of the page.

Right, the pinning could have been done by some other subsystem.

> 
> It's not an impossible problem to solve, but I suspect any solution will require
> either touching a lot of code or will be fragile and difficult to maintain, e.g.
> by auditing all users to understand which need to pin and which don't.  Even if
> we _always_ pin memory for SEV guests, we'd still need to plumb the "is SEV guest"
> info around.
> 
> And FWIW, my years-old idea of using a software-available SPTE bit to track pinned
> pages is plagued by the same underlying issue: KVM's current management (or lack
> thereof) of SEV guest memory just isn't viable long term.  In all honesty, it
> probably should never have been merged.  We can't change the past, but we can, 
> and IMO should, avoid piling on more code to an approach that is fundamentally flawed.
> 	
> [*] https://lore.kernel.org/all/20220310140911.50924-1-chao.p.peng@xxxxxxxxxxxxxxx
>

Thanks for the valuable feedback. 

Regards
Nikunj