Re: [PATCH v4 0/9] KVM/s390: Hugetlbfs enablement

Janosch Frank <frankja@xxxxxxxxxxxxx> · Thu, 5 Jul 2018 14:27:24 +0200

On 05.07.2018 09:26, David Hildenbrand wrote:
> On 27.06.2018 15:55, Janosch Frank wrote:
>> So, after the userfaultfd fix postcopy does work now, but vSIE in
>> combination with paging can still result in crashing g3s. Therefore we
>> split up the series and only integrate non-vSIE support for now.
>>
>> Branch:
>> git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git hlp_vsie
>> https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git/log/?h=hlp_vsie
>>
> 
> So, after some off-lists discussions my view on things
> 
> 1. No dirty tracking on user space tables is necessary. All KVM internal
> guest memory accesses have to manually set dirty bits (and we seem to do
> very well already). All user space (QEMU) accesses have to do that on
> their own (also usually already done but we might miss some corner cases).
> 
> 2. Splitting of huge PMDs is not necessary without vSIE and only a
> performance improvement for migration ("dirty 4k vs 1m").

I feel uncomfortable without them. As soon as a notifier is set or
memory is protected in a range smaller than 1m, we should split. The
notifications should be on the smallest possible granularity and 4k
gives us the most flexibility and space for a lot of notifier bits in
the PGSTE.

I don't want to have IN on ptes and pmds.

> 
> 3. PGSTE for split PMDs could be avoided just like for shadow GMAPs, we
> only have to care about locking at the two notifier bits.

Theoretically we can until VSIE comes along and then we need PGSTE_VSIE
and PGSTE_IN. So we actually need them.

> 
> So in summary, I would really like to see a v1 without split of PMDs and
> a v2 with lazy splitting of all PMDs. This might require some work, but
> it should be fairly easy to get v1 implemented (in contrast to the
> splitting magic). v3 can then take care of vSIE. If I am not completely
> wrong, even THP could be handled fairly easy in a v1 already.

I just factored out splitting and re-added IN for segments in an extra
patch, so we actually see what the changes are before I squash it. We
should be absolutely sure that we want it this way, because I'm
absolutely not gonna rewrite it again.

Even this "simple" change completely broke huge page support, possibly
because I forgot one single line somewhere.

> 
> Especially v2 can handle it similar to x86:
> 
> a) dirty tracking: when enabling, mark all PMDs in the GMAP as R/O. All
> new *read* GMAP faults can map PMDs. All new GMAP *write* faults will
> only create split PMDs and do this lazily (replacing protected PMDs by
> split PMDs). So after a write fault, only one new 4k page will be mapped
> writable and marked as dirty. We can detect if dirty tracking is turned
> on/off using the kvm slots. We might be able to simply make global
> decisions to simplify the implementation "one slot is set to do dirty
> tracking, do it for all".

Not a big fan of that, it means that for migration I suddenly need more
kernel memory.

> 
> b) mprotect/protect_range: As long as we have no restricted protection
> (e.g. prefix only sets notifier bit) there is no need to split. You can
> use a notifier bit on the PMD. Once we actually want to restrict access
> (sub PMD protection for e.g. gmap shadow - we should forbid this
> initially on huge PMDs), we have to split. This means, without vSIE we
> never have to split except for a).

See 2.

> 
> The nice thing about this approach (if I didn't miss anything) is that
> the PMDs with a guest prefix on them are only split if dirty tracking is
> active (or vSIE requires it via mprotect()) - in contrast to this
> series. And in contrast to this series, *all* memory will be dirty
> tracked using 4k instead of 1MB pages, which makes a lot of sense - even
> 1 CPU can otherwise set so many 1MB pages dirty easily during migration.

Is this a real concern that actually breaks migration?

> 
>>
>> V4:
>> 	* Split up vSIE patches
>> 	* Added hpage module parameter disabling vSIE when set.
>> 	* Added HPAGE capability that has to be enabled for huge guests
>>
>> V3:
>> 	* Moved splitting to the front.
>> 	* Cleanups
>>
>> V2:
>> 	* Incorporated changes from David's cleanup
>> 	* Now flushing with IDTE_NODAT for protection transfers.
>> 	* Added RRBE huge page handling for g2 -> g3 skey emulation
>> 	* Added documentation for capability
>> 	* Renamed GMAP_ENTRY_* constants
>> 	* Added SEGMENT hardware bits constants
>> 	* Improved some patch descriptions
>> 	* General small improvements
>> 	* Introduced pte_from_pmd function
>>
>> Dominik Dingel (2):
>>   s390/mm: clear huge page storage keys on enable_skey
>>   s390/mm: hugetlb pages within a gmap can not be freed
>>
>> Janosch Frank (7):
>>   s390/mm: Abstract gmap notify bit setting
>>   s390/mm: Introduce gmap_pmdp_xchg
>>   s390/mm: add gmap pmd invalidation notification
>>   s390/mm: Add gmap pmd invalidation and clearing
>>   s390/mm: Add huge page dirty sync support
>>   s390/mm: Add huge pmd storage key handling
>>   s390/mm: Enable gmap huge pmd support
>>
>>  Documentation/virtual/kvm/api.txt   |  16 +
>>  arch/s390/include/asm/gmap.h        |  23 ++
>>  arch/s390/include/asm/mmu.h         |   2 +
>>  arch/s390/include/asm/mmu_context.h |   1 +
>>  arch/s390/include/asm/pgtable.h     |  17 +-
>>  arch/s390/kvm/kvm-s390.c            |  57 +++-
>>  arch/s390/mm/gmap.c                 | 612 ++++++++++++++++++++++++++++++++++--
>>  arch/s390/mm/pageattr.c             |   6 +-
>>  arch/s390/mm/pgtable.c              | 182 ++++++++++-
>>  include/uapi/linux/kvm.h            |   1 +
>>  10 files changed, 855 insertions(+), 62 deletions(-)
>>
> 
> 

Attachment:
signature.asc

Description: OpenPGP digital signature