On 08/04/2018 03:15 AM, Michael S. Tsirkin wrote:
On Fri, Aug 03, 2018 at 04:32:26PM +0800, Wei Wang wrote:
The OOM notifier is getting deprecated to use for the reasons:
- As a callout from the oom context, it is too subtle and easy to
generate bugs and corner cases which are hard to track;
- It is called too late (after the reclaiming has been performed).
Drivers with large amuont of reclaimable memory is expected to
release them at an early stage of memory pressure;
- The notifier callback isn't aware of oom contrains;
Link: https://lkml.org/lkml/2018/7/12/314
This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure. The balloon pages are
given back to mm adaptively by returning the number of pages that the
reclaimer is asking for (i.e. sc->nr_to_scan).
Currently the max possible value of sc->nr_to_scan passed to the balloon
shrinker is SHRINK_BATCH, which is 128. This is smaller than the
limitation that only VIRTIO_BALLOON_ARRAY_PFNS_MAX (256) pages can be
returned via one invocation of leak_balloon. But this patch still
considers the case that SHRINK_BATCH or shrinker->batch could be changed
to a value larger than VIRTIO_BALLOON_ARRAY_PFNS_MAX, which will need to
do multiple invocations of leak_balloon.
Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has been used
to release balloon pages on OOM. We continue to use this feature bit for
the shrinker, so the shrinker is only registered when this feature bit
has been negotiated with host.
Signed-off-by: Wei Wang <wei.w.wang@xxxxxxxxx>
Cc: Michael S. Tsirkin <mst@xxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Could you add data at how was this tested and how did guest
behaviour change. Which configurations see an improvement?
Yes. Please see the differences from the "*1" and "*2" cases below.
Taking this chance, I use "*2" and "*3" to show Michal etc the
differences of applying and not applying the shrinker fix patch here:
https://lkml.org/lkml/2018/8/3/384
*1. V3 patches
1)After inflating some amount of memory, actual=1000001536 Bytes
free -m
total used free shared buff/cache
available
Mem: 7975 7289 514 10 171 447
Swap: 10236 0 10236
2) dd if=478MB_file of=/dev/null, actual=1058721792 Bytes
free -m
total used free shared buff/cache
available
Mem: 7975 7233 102 10 639 475
Swap: 10236 0 10236
The advantage is that the inflated pages are given back to mm based on
the number, i.e. ~56MB(diff "actual" above) of the reclaimer is asking
for. This is more adaptive.
*2. V2 paches, balloon_pages_to_shrink=1000000 pages (around 4GB), with
the shrinker fix patches applied.
1)After inflating some amount of memory, actual=1000001536 Bytes
free -m
total used free shared buff/cache
available
Mem: 7975 7288 530 10 157 455
Swap: 10236 0 10236
2)dd if=478MB_file of=/dev/null, actual=5096001536 Bytes
free -m
total used free shared buff/cache
available
Mem: 7975 3381 3953 10 640 4327
Swap: 10236 0 10236
In the above example, we set 4GB to shrink to make the difference
obvious. Though the claimer only needs to reclaim ~56MB memory, 4GB
inflated pages are given back to mm, which is unnecessary. From the
user's perspective, it has no idea of how many pages to given back at
the time of setting the module parameter (balloon_pages_to_shrink). So I
think the above "*1" is better.
*3. V2 paches, balloon_pages_to_shrink=1000000 pages (around 4GB),
without the shrinker fix patches applied.
1) After inflating some amount of memory, actual=1000001536 Bytes
free -m
total used free shared buff/cache
available
Mem: 7975 7292 524 10 158 450
Swap: 10236 0 10236
2) dd if=478MB_file of=/dev/null, actual=8589934592 Bytes
free -m
total used free shared buff/cache
available
Mem: 7975 53 7281 10 640 7656
Swap: 10236 0 10236
Compared to *2, all the balloon pages are shrunk, but users expect 4GB
to shrink. The reason is that do_slab_shrink has a mistake in
calculating schrinkctl->nr_scanned, which should be the actual number of
pages that the shrinker has freed, but do slab_shrink still treat that
value as 128 (but 4GB has actually been freed).
Best,
Wei