Re: [PATCH v4 0/9] KVM/s390: Hugetlbfs enablement

David Hildenbrand <david@xxxxxxxxxx> · Thu, 5 Jul 2018 15:26:40 +0200

On 05.07.2018 15:01, Janosch Frank wrote:
> On 05.07.2018 14:47, David Hildenbrand wrote:
>> On 05.07.2018 14:27, Janosch Frank wrote:
>>> On 05.07.2018 09:26, David Hildenbrand wrote:
>>>> On 27.06.2018 15:55, Janosch Frank wrote:
>>>>> So, after the userfaultfd fix postcopy does work now, but vSIE in
>>>>> combination with paging can still result in crashing g3s. Therefore we
>>>>> split up the series and only integrate non-vSIE support for now.
>>>>>
>>>>> Branch:
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git hlp_vsie
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git/log/?h=hlp_vsie
>>>>>
>>>>
>>>> So, after some off-lists discussions my view on things
>>>>
>>>> 1. No dirty tracking on user space tables is necessary. All KVM internal
>>>> guest memory accesses have to manually set dirty bits (and we seem to do
>>>> very well already). All user space (QEMU) accesses have to do that on
>>>> their own (also usually already done but we might miss some corner cases).
>>>>
>>>> 2. Splitting of huge PMDs is not necessary without vSIE and only a
>>>> performance improvement for migration ("dirty 4k vs 1m").
>>>
>>> I feel uncomfortable without them. As soon as a notifier is set or
>>> memory is protected in a range smaller than 1m, we should split. The
>>> notifications should be on the smallest possible granularity and 4k
>>> gives us the most flexibility and space for a lot of notifier bits in
>>> the PGSTE.
>>>
>>> I don't want to have IN on ptes and pmds.
>>
>> As I said, we only have to split if we really protect. Notifying can be
>> handled on PMDs. It shouldn't really be that complicated to have two
>> bits and do the invalidation.
> 
> Yes, we don't have to, but I think it's somewhat cleaner.
> 
> It's not complicated, but for vSIE where we split we would have IN on
> ptes and on pmds and I don't want two possibilities for notification on
> IN. Or notifying IN on the pmd and then going to the ptes for notifying
> vSIE tables...

I agree that split PMDs should only have notifiers on PTEs. But that
should be easy to enforce.

But we can talk about that once we have the basic machinery without
splits running.

> 
>>
>>>
>>>>
>>>> 3. PGSTE for split PMDs could be avoided just like for shadow GMAPs, we
>>>> only have to care about locking at the two notifier bits.
>>>
>>> Theoretically we can until VSIE comes along and then we need PGSTE_VSIE
>>> and PGSTE_IN. So we actually need them.
>>
>> You can reuse whatever software bits we have for PTEs for that. These
>> are GMAP-only tables.
> 
> I could, but splitting is horrible enough, I want to keep split ptes as
> close to the normal ones as possible.

I agree that that can be changed later easily. You can stick to PGSTE
for now if it makes your life easier :) In the long term we might want
to reduce space consumption (especially) and switch to simple page tables.
[...]

>>
>>>
>>>>
>>>> b) mprotect/protect_range: As long as we have no restricted protection
>>>> (e.g. prefix only sets notifier bit) there is no need to split. You can
>>>> use a notifier bit on the PMD. Once we actually want to restrict access
>>>> (sub PMD protection for e.g. gmap shadow - we should forbid this
>>>> initially on huge PMDs), we have to split. This means, without vSIE we
>>>> never have to split except for a).
>>>
>>> See 2.
>>>
>>>>
>>>> The nice thing about this approach (if I didn't miss anything) is that
>>>> the PMDs with a guest prefix on them are only split if dirty tracking is
>>>> active (or vSIE requires it via mprotect()) - in contrast to this
>>>> series. And in contrast to this series, *all* memory will be dirty
>>>> tracked using 4k instead of 1MB pages, which makes a lot of sense - even
>>>> 1 CPU can otherwise set so many 1MB pages dirty easily during migration.
>>>
>>> Is this a real concern that actually breaks migration?
>>>
>>
>> Well, the reason for splits without vSIE from your (and Christians side)
>> was as far as I understood
>>
>> "prefix is always dirty, therefore we mark too many 1MB pages dirty".
>>
>> My point is that even one CPU can simply mark during one migration
>> iteration so many 1MB pages dirty that you run into the exact same
>> problems. There seems to be a good reason why x86 does it that way.
>>
> 
> Let's trust in X, what could go wrong? :)

Unfortunately we cannot trust in Z here :)

-- 

Thanks,

David / dhildenb