Re: [RFC PATCH] mm/page_reporting: Adjust threshold according to MAX_ORDER

Alexander Duyck <alexander.duyck@xxxxxxxxx> · Mon, 14 Jun 2021 19:26:55 -0700

On Mon, Jun 14, 2021 at 4:03 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 11.06.21 09:44, Gavin Shan wrote:
> > On 6/1/21 6:01 PM, David Hildenbrand wrote:
> >> On 01.06.21 05:33, Gavin Shan wrote:
> >>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken as
> >>> minimal order (threshold) to trigger page reporting. The page reporting
> >>> is never triggered with the following configurations and settings on
> >>> aarch64. In the particular scenario, the page reporting won't be triggered
> >>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from the
> >>> page freeing. The condition is very hard, or even impossible to be met.
> >>>
> >>>     CONFIG_ARM64_PAGE_SHIFT:              16
> >>>     CONFIG_HUGETLB_PAGE:                  Y
> >>>     CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:    N
> >>>     pageblock_order:                      13
> >>>     CONFIG_FORCE_MAX_ZONEORDER:           14
> >>>     MAX_ORDER:                            14
> >>>
> >>> The issue can be reproduced in VM, running kernel with above configurations
> >>> and settings. The 'memhog' is used inside the VM to access 512MB anonymous
> >>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exits.
> >>>
> >>>     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64          \
> >>>     -accel kvm -machine virt,gic-version=host                        \
> >>>     -cpu host -smp 8,sockets=2,cores=4,threads=1 -m 4096M,maxmem=64G \
> >>>     -object memory-backend-ram,id=mem0,size=2048M                    \
> >>>     -object memory-backend-ram,id=mem1,size=2048M                    \
> >>>     -numa node,nodeid=0,cpus=0-3,memdev=mem0                         \
> >>>     -numa node,nodeid=1,cpus=4-7,memdev=mem1                         \
> >>>       :                                                              \
> >>>     -device virtio-balloon-pci,id=balloon0,free-page-reporting=yes
> >>>
> >>> This tries to fix the issue by adjusting the threshold to the smaller value
> >>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEMU's RSS
> >>> drops after 'memhog' exits.
> >>
> >> IIRC, we use pageblock_order to
> >>
> >> a) Reduce the free page reporting overhead. Reporting on small chunks can make us report constantly with little system activity.
> >>
> >> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM performance.
> >>
> >> c) Avoid affecting creation of pageblock_order pages while hinting is active. I think there are cases where "temporary pulling sub-pageblock pages" can negatively affect creation of pageblock_order pages. Concurrent compaction would be one of these cases.
> >>
> >> The monstrosity called aarch64 64k is really special in that sense, because a) does not apply because pageblocks are just very big, b) does sometimes not apply because either our VM isn't backed by (rare) 512MB THP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs because we don't really happen to create 512MB THP either way.
> >>
> >>
> >> For example, going on x86-64 from reporting 2MB to something like 32KB is absolutely undesired.
> >>
> >> I think if we want to go down that path (and I am not 100% sure yet if we want to), we really want to treat only the special case in a special way. Note that even when doing it only for aarch64 with 64k, you will still end up splitting THP in a hypervisor if it uses 64k base pages (b)) and can affect creation of THP, for example, when compacting (c), so there is a negative side to that.
> >>
> >
> > [Remove Alexander from the cc list as his mail isn't reachable]
> >
>
> [adding his gmail address which should be the right one]
>
> > David, thanks for your time to review and sorry for the delay and late response.
> > I spent some time to get myself familiar with the code, but there are still some
> > questions to me, explained as below.
> >
> > Yes, @pageblock_order is currently taken as page reporting threshold. It will
> > incur more overhead if the threshold is decreased as you said in (a).
>
> Right. Alex did quite some performance/overhead evaluation when
> introducing this feature. Changing the reporting granularity on most
> setups (esp., x86-64) is not desired IMHO.

Yes, generally reporting pages comes at a fairly high cost so it is
important to find the right trade-off between the size of the page and
the size of the batch of pages being reported. If the size of the
pages is reduced it maybe important to increase the batch size in
order to avoid paying too much in the way of overhead.

The other main reason for holding to pageblock_order on x86 is to
avoid THP splitting. Anything smaller than pageblock_order will
trigger THP splitting which will significantly hurt the performance of
the VM in general as it forces it down to order 0 pages.