Re: [RFC 09/37] KVM: s390: protvirt: Implement on-demand pinning

Janosch Frank <frankja@xxxxxxxxxxxxx> · Mon, 4 Nov 2019 11:25:04 +0100

On 11/4/19 11:19 AM, David Hildenbrand wrote:
>>>> to synchronize page import/export with the I/O for paging. For example you can actually
>>>> fault in a page that is currently under paging I/O. What do you do? import (so that the
>>>> guest can run) or export (so that the I/O will work). As this turned out to be harder then
>>>> we though we decided to defer paging to a later point in time.
>>>
>>> I don't quite see the issue yet. If you page out, the page will
>>> automatically (on access) be converted to !secure/encrypted memory. If
>>> the UV/guest wants to access it, it will be automatically converted to
>>> secure/unencrypted memory. If you have concurrent access, it will be
>>> converted back and forth until one party is done.
>>
>> IO does not trigger an export on an imported page, but an error
>> condition in the IO subsystem. The page code does not read pages through
> 
> Ah, that makes it much clearer. Thanks!
> 
>> the cpu, but often just asks the device to read directly and that's
>> where everything goes wrong. We could bounce swapping, but chose to pin
>> for now until we find a proper solution to that problem which nicely
>> integrates into linux.
> 
> How hard would it be to
> 
> 1. Detect the error condition
> 2. Try a read on the affected page from the CPU (will will automatically 
> convert to encrypted/!secure)
> 3. Restart the I/O
> 
> I assume that this is a corner case where we don't really have to care 
> about performance in the first shot.

Restarting IO can be quite difficult with CCW, we might need to change
request data...

> 
>>
>>>
>>> A proper automatic conversion should make this work. What am I missing?
>>>
>>>>
>>>> As we do not want to rely on the userspace to do the mlock this is now done in the kernel.
>>>
>>> I wonder if we could come up with an alternative (similar to how we
>>> override VM_MERGEABLE in the kernel) that can be called and ensured in
>>> the kernel. E.g., marking whole VMAs as "don't page" (I remember
>>> something like "special VMAs" like used for VDSOs that achieve exactly
>>> that, but I am absolutely no expert on that). That would be much nicer
>>> than pinning all pages and remembering what you pinned in huge page
>>> arrays ...
>>
>> It might be more worthwhile to just accept one or two releases with
>> pinning and fix the root of the problem than design a nice stopgap.
> 
> Quite honestly, to me this feels like a prototype hack that deserves a 
> proper solution first. The issue with this hack is that it affects user 
> space (esp. MADV_DONTNEED no longer working correctly). It's not just 
> something you once fix in the kernel and be done with it.

It is a hack, yes.
But we're not the only architecture to need it x86 pins all the memory
at the start of the VM and that code is already upstream...

>>
>> Btw. s390 is not alone with the problem and we'll try to have another
>> discussion tomorrow with AMD to find a solution which works for more
>> than one architecture.
> 
> Let me know if there was an interesting outcome.
> 

Attachment:
signature.asc

Description: OpenPGP digital signature