Re: [PATCH v4 0/9] KVM/s390: Hugetlbfs enablement

Janosch Frank <frankja@xxxxxxxxxxxxx> · Thu, 5 Jul 2018 15:01:52 +0200

On 05.07.2018 14:47, David Hildenbrand wrote:
> On 05.07.2018 14:27, Janosch Frank wrote:
>> On 05.07.2018 09:26, David Hildenbrand wrote:
>>> On 27.06.2018 15:55, Janosch Frank wrote:
>>>> So, after the userfaultfd fix postcopy does work now, but vSIE in
>>>> combination with paging can still result in crashing g3s. Therefore we
>>>> split up the series and only integrate non-vSIE support for now.
>>>>
>>>> Branch:
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git hlp_vsie
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git/log/?h=hlp_vsie
>>>>
>>>
>>> So, after some off-lists discussions my view on things
>>>
>>> 1. No dirty tracking on user space tables is necessary. All KVM internal
>>> guest memory accesses have to manually set dirty bits (and we seem to do
>>> very well already). All user space (QEMU) accesses have to do that on
>>> their own (also usually already done but we might miss some corner cases).
>>>
>>> 2. Splitting of huge PMDs is not necessary without vSIE and only a
>>> performance improvement for migration ("dirty 4k vs 1m").
>>
>> I feel uncomfortable without them. As soon as a notifier is set or
>> memory is protected in a range smaller than 1m, we should split. The
>> notifications should be on the smallest possible granularity and 4k
>> gives us the most flexibility and space for a lot of notifier bits in
>> the PGSTE.
>>
>> I don't want to have IN on ptes and pmds.
> 
> As I said, we only have to split if we really protect. Notifying can be
> handled on PMDs. It shouldn't really be that complicated to have two
> bits and do the invalidation.

Yes, we don't have to, but I think it's somewhat cleaner.

It's not complicated, but for vSIE where we split we would have IN on
ptes and on pmds and I don't want two possibilities for notification on
IN. Or notifying IN on the pmd and then going to the ptes for notifying
vSIE tables...

> 
>>
>>>
>>> 3. PGSTE for split PMDs could be avoided just like for shadow GMAPs, we
>>> only have to care about locking at the two notifier bits.
>>
>> Theoretically we can until VSIE comes along and then we need PGSTE_VSIE
>> and PGSTE_IN. So we actually need them.
> 
> You can reuse whatever software bits we have for PTEs for that. These
> are GMAP-only tables.

I could, but splitting is horrible enough, I want to keep split ptes as
close to the normal ones as possible.

> 
>>
>>>
>>> So in summary, I would really like to see a v1 without split of PMDs and
>>> a v2 with lazy splitting of all PMDs. This might require some work, but
>>> it should be fairly easy to get v1 implemented (in contrast to the
>>> splitting magic). v3 can then take care of vSIE. If I am not completely
>>> wrong, even THP could be handled fairly easy in a v1 already.
>>
>> I just factored out splitting and re-added IN for segments in an extra
>> patch, so we actually see what the changes are before I squash it. We
>> should be absolutely sure that we want it this way, because I'm
>> absolutely not gonna rewrite it again.
> 
> Better make a copy ;)

I did.

> 
>>
>> Even this "simple" change completely broke huge page support, possibly
>> because I forgot one single line somewhere.
> 
> This means that
> 
> 1. We are forgetting something important, which would be bad.
> 2. Your current code is too complicated. :P
> 
>>
>>>
>>> Especially v2 can handle it similar to x86:
>>>
>>> a) dirty tracking: when enabling, mark all PMDs in the GMAP as R/O. All
>>> new *read* GMAP faults can map PMDs. All new GMAP *write* faults will
>>> only create split PMDs and do this lazily (replacing protected PMDs by
>>> split PMDs). So after a write fault, only one new 4k page will be mapped
>>> writable and marked as dirty. We can detect if dirty tracking is turned
>>> on/off using the kvm slots. We might be able to simply make global
>>> decisions to simplify the implementation "one slot is set to do dirty
>>> tracking, do it for all".
>>
>> Not a big fan of that, it means that for migration I suddenly need more
>> kernel memory.
> 
> We do that already for the dirty bitmap. And if x86 does it similarly,
> this should not be that problematic.

Alright, since you and Christian want it, let's do that.

> 
>>
>>>
>>> b) mprotect/protect_range: As long as we have no restricted protection
>>> (e.g. prefix only sets notifier bit) there is no need to split. You can
>>> use a notifier bit on the PMD. Once we actually want to restrict access
>>> (sub PMD protection for e.g. gmap shadow - we should forbid this
>>> initially on huge PMDs), we have to split. This means, without vSIE we
>>> never have to split except for a).
>>
>> See 2.
>>
>>>
>>> The nice thing about this approach (if I didn't miss anything) is that
>>> the PMDs with a guest prefix on them are only split if dirty tracking is
>>> active (or vSIE requires it via mprotect()) - in contrast to this
>>> series. And in contrast to this series, *all* memory will be dirty
>>> tracked using 4k instead of 1MB pages, which makes a lot of sense - even
>>> 1 CPU can otherwise set so many 1MB pages dirty easily during migration.
>>
>> Is this a real concern that actually breaks migration?
>>
> 
> Well, the reason for splits without vSIE from your (and Christians side)
> was as far as I understood
> 
> "prefix is always dirty, therefore we mark too many 1MB pages dirty".
> 
> My point is that even one CPU can simply mark during one migration
> iteration so many 1MB pages dirty that you run into the exact same
> problems. There seems to be a good reason why x86 does it that way.
> 

Let's trust in X, what could go wrong? :)

Attachment:
signature.asc

Description: OpenPGP digital signature