Re: [RFC 09/37] KVM: s390: protvirt: Implement on-demand pinning

Christian Borntraeger <borntraeger@xxxxxxxxxx> · Mon, 4 Nov 2019 14:58:02 +0100

On 04.11.19 11:19, David Hildenbrand wrote:
>>>> to synchronize page import/export with the I/O for paging. For example you can actually
>>>> fault in a page that is currently under paging I/O. What do you do? import (so that the
>>>> guest can run) or export (so that the I/O will work). As this turned out to be harder then
>>>> we though we decided to defer paging to a later point in time.
>>>
>>> I don't quite see the issue yet. If you page out, the page will
>>> automatically (on access) be converted to !secure/encrypted memory. If
>>> the UV/guest wants to access it, it will be automatically converted to
>>> secure/unencrypted memory. If you have concurrent access, it will be
>>> converted back and forth until one party is done.
>>
>> IO does not trigger an export on an imported page, but an error
>> condition in the IO subsystem. The page code does not read pages through
> 
> Ah, that makes it much clearer. Thanks!
> 
>> the cpu, but often just asks the device to read directly and that's
>> where everything goes wrong. We could bounce swapping, but chose to pin
>> for now until we find a proper solution to that problem which nicely
>> integrates into linux.
> 
> How hard would it be to
> 
> 1. Detect the error condition
> 2. Try a read on the affected page from the CPU (will will automatically convert to encrypted/!secure)
> 3. Restart the I/O
> 
> I assume that this is a corner case where we don't really have to care about performance in the first shot.

We have looked into this. You would need to implement this in the low level
handler for every I/O. DASD, FCP, PCI based NVME, iscsi. Where do you want
to stop?
There is no proper global bounce buffer that works for everything. 

>>>
>>> A proper automatic conversion should make this work. What am I missing?
>>>
>>>>
>>>> As we do not want to rely on the userspace to do the mlock this is now done in the kernel.
>>>
>>> I wonder if we could come up with an alternative (similar to how we
>>> override VM_MERGEABLE in the kernel) that can be called and ensured in
>>> the kernel. E.g., marking whole VMAs as "don't page" (I remember
>>> something like "special VMAs" like used for VDSOs that achieve exactly
>>> that, but I am absolutely no expert on that). That would be much nicer
>>> than pinning all pages and remembering what you pinned in huge page
>>> arrays ...
>>
>> It might be more worthwhile to just accept one or two releases with
>> pinning and fix the root of the problem than design a nice stopgap.
> 
> Quite honestly, to me this feels like a prototype hack that deserves a proper solution first. The issue with this hack is that it affects user space (esp. MADV_DONTNEED no longer working correctly). It's not just something you once fix in the kernel and be done with it.

I disagree. Pinning is a valid initial version. I would find it strange to
allow it for AMD SEV, but not allowing it for s390x. 
As far as I can tell  MADV_DONTNEED continues to work within the bounds
of specification. In fact, it does work (or does not depending on your 
perspective :-) ) exactly in the same way as on hugetlbfs,which is also
a way of pinning.

And yes, I am in full agreement that we must work on lifting that
restriction. 

>>
>> Btw. s390 is not alone with the problem and we'll try to have another
>> discussion tomorrow with AMD to find a solution which works for more
>> than one architecture.
> 
> Let me know if there was an interesting outcome.