On 10.02.19 01:38, Michael S. Tsirkin wrote: > On Fri, Feb 08, 2019 at 02:05:09PM -0800, Alexander Duyck wrote: >> On Fri, Feb 8, 2019 at 1:38 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: >>> >>> On Fri, Feb 08, 2019 at 03:41:55PM -0500, Nitesh Narayan Lal wrote: >>>>>> I am also planning to try Michael's suggestion of using MAX_ORDER - 1. >>>>>> However I am still thinking about a workload which I can use to test its >>>>>> effectiveness. >>>>> You might want to look at doing something like min(MAX_ORDER - 1, >>>>> HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for >>>>> THP which is the most likely to be used page size with the guest. >>>> Sure, thanks for the suggestion. >>> >>> Given current hinting in balloon is MAX_ORDER I'd say >>> share code. If you feel a need to adjust down the road, >>> adjust both of them with actual testing showing gains. >> >> Actually I'm left kind of wondering why we are even going through >> virtio-balloon for this? > > Just look at what does it do. > > It improves memory overcommit if guests are cooperative, and it does > this by giving the hypervisor addresses of pages which it can discard. > > It's just *exactly* like the balloon with all the same limitations. I agree, this belongs to virtio-balloon *unless* we run into real problems implementing it via an asynchronous mechanism. > >> It seems like this would make much more sense >> as core functionality of KVM itself for the specific architectures >> rather than some side thing. Whatever can be handled in user space and does not have significant performance impacts should be handled in user space. If we run into real problems with that approach, fair enough. (e.g. vcpu yielding is a good example where an implementation in KVM makes sense, not going via QEMU) > > Well same as balloon: whether it's useful to you at all > would very much depend on your workloads. > > This kind of cooperative functionality is good for co-located > single-tenant VMs. That's pretty niche. The core things in KVM > generally don't trust guests. > > >> In addition this could end up being >> redundant when you start getting into either the s390 or PowerPC >> architectures as they already have means of providing unused page >> hints. I'd like to note that on s390x the functionality is not provided when running nested guests. And there are real problems getting it ever supported. (see description below how it works on s390x, the issue for nested guests are the bits in the guest -> host page tables we cannot support for nested guests). Hinting only works for guests running one level under LPAR (with a recent machine), but not nested guests. (LPAR -> KVM1 works, LPAR - KVM1 -> KVM2 foes not work for the latter) So an implementation for s390 would still make sense for this scenario. > > Interesting. Is there host support in kvm? On s390x there is. It works on page granularity and synchronization between guest/host ("don't drop a page in the host while the guest is reusing it") is done via special bits in the host->guest page table. Instructions in the guest are able to modify these bits. A guest can configure a "usage state" of it's backed PTEs. E.g. "unused" or "stable". Whenever a page in the guest is freed/reused, the ESSA instruction is triggered in the guest. It will modify the page table bits and add the guest phyical pfn to a buffer in the host. Once that buffer is full, ESSA will trigger an intercept to the hypervisor. Here, all these "unused" pages can be zapped. Also, when swapping a page out in the hypervisor, if it was masked by the guest as unused or logically zero, instead of swapping out the page, it can simply be dropped and a fresh zero page can be supplied when the guest tries to access it. "ESSA" is implemented in KVM in arch/s390/kvm/priv.c:handle_essa(). So on s390x, it works because the synchronization with the hypervisor is directly built into hw vitualization support (guest->host page tables + instruction) and ESSA will not intercept on every call (due to the buffer). > > >> I have a set of patches I proposed that add similar functionality via >> a KVM hypercall for x86 instead of doing it as a part of a Virtio >> device[1]. I'm suspecting the overhead of doing things this way is >> much less then having to make multiple madvise system calls from QEMU >> back into the kernel. > > Well whether it's a virtio device is orthogonal to whether it's an > madvise call, right? You can build vhost-pagehint and that can > handle requests in a VQ within balloon and do it > within host kernel directly. > > virtio rings let you pass multiple pages so it's really hard to > say which will win outright - maybe it's more important > to coalesce exits. We don't know until we measure it. -- Thanks, David / dhildenb