Re: [PATCH v4 0/9] KVM/s390: Hugetlbfs enablement

David Hildenbrand <david@xxxxxxxxxx> · Thu, 5 Jul 2018 14:47:08 +0200

On 05.07.2018 14:27, Janosch Frank wrote:
> On 05.07.2018 09:26, David Hildenbrand wrote:
>> On 27.06.2018 15:55, Janosch Frank wrote:
>>> So, after the userfaultfd fix postcopy does work now, but vSIE in
>>> combination with paging can still result in crashing g3s. Therefore we
>>> split up the series and only integrate non-vSIE support for now.
>>>
>>> Branch:
>>> git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git hlp_vsie
>>> https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git/log/?h=hlp_vsie
>>>
>>
>> So, after some off-lists discussions my view on things
>>
>> 1. No dirty tracking on user space tables is necessary. All KVM internal
>> guest memory accesses have to manually set dirty bits (and we seem to do
>> very well already). All user space (QEMU) accesses have to do that on
>> their own (also usually already done but we might miss some corner cases).
>>
>> 2. Splitting of huge PMDs is not necessary without vSIE and only a
>> performance improvement for migration ("dirty 4k vs 1m").
> 
> I feel uncomfortable without them. As soon as a notifier is set or
> memory is protected in a range smaller than 1m, we should split. The
> notifications should be on the smallest possible granularity and 4k
> gives us the most flexibility and space for a lot of notifier bits in
> the PGSTE.
> 
> I don't want to have IN on ptes and pmds.

As I said, we only have to split if we really protect. Notifying can be
handled on PMDs. It shouldn't really be that complicated to have two
bits and do the invalidation.

> 
>>
>> 3. PGSTE for split PMDs could be avoided just like for shadow GMAPs, we
>> only have to care about locking at the two notifier bits.
> 
> Theoretically we can until VSIE comes along and then we need PGSTE_VSIE
> and PGSTE_IN. So we actually need them.

You can reuse whatever software bits we have for PTEs for that. These
are GMAP-only tables.

> 
>>
>> So in summary, I would really like to see a v1 without split of PMDs and
>> a v2 with lazy splitting of all PMDs. This might require some work, but
>> it should be fairly easy to get v1 implemented (in contrast to the
>> splitting magic). v3 can then take care of vSIE. If I am not completely
>> wrong, even THP could be handled fairly easy in a v1 already.
> 
> I just factored out splitting and re-added IN for segments in an extra
> patch, so we actually see what the changes are before I squash it. We
> should be absolutely sure that we want it this way, because I'm
> absolutely not gonna rewrite it again.

Better make a copy ;)

> 
> Even this "simple" change completely broke huge page support, possibly
> because I forgot one single line somewhere.

This means that

1. We are forgetting something important, which would be bad.
2. Your current code is too complicated. :P

> 
>>
>> Especially v2 can handle it similar to x86:
>>
>> a) dirty tracking: when enabling, mark all PMDs in the GMAP as R/O. All
>> new *read* GMAP faults can map PMDs. All new GMAP *write* faults will
>> only create split PMDs and do this lazily (replacing protected PMDs by
>> split PMDs). So after a write fault, only one new 4k page will be mapped
>> writable and marked as dirty. We can detect if dirty tracking is turned
>> on/off using the kvm slots. We might be able to simply make global
>> decisions to simplify the implementation "one slot is set to do dirty
>> tracking, do it for all".
> 
> Not a big fan of that, it means that for migration I suddenly need more
> kernel memory.

We do that already for the dirty bitmap. And if x86 does it similarly,
this should not be that problematic.

> 
>>
>> b) mprotect/protect_range: As long as we have no restricted protection
>> (e.g. prefix only sets notifier bit) there is no need to split. You can
>> use a notifier bit on the PMD. Once we actually want to restrict access
>> (sub PMD protection for e.g. gmap shadow - we should forbid this
>> initially on huge PMDs), we have to split. This means, without vSIE we
>> never have to split except for a).
> 
> See 2.
> 
>>
>> The nice thing about this approach (if I didn't miss anything) is that
>> the PMDs with a guest prefix on them are only split if dirty tracking is
>> active (or vSIE requires it via mprotect()) - in contrast to this
>> series. And in contrast to this series, *all* memory will be dirty
>> tracked using 4k instead of 1MB pages, which makes a lot of sense - even
>> 1 CPU can otherwise set so many 1MB pages dirty easily during migration.
> 
> Is this a real concern that actually breaks migration?
> 

Well, the reason for splits without vSIE from your (and Christians side)
was as far as I understood

"prefix is always dirty, therefore we mark too many 1MB pages dirty".

My point is that even one CPU can simply mark during one migration
iteration so many 1MB pages dirty that you run into the exact same
problems. There seems to be a good reason why x86 does it that way.

-- 

Thanks,

David / dhildenb