This means that the same address space can have memory belonging to
more than one protected guest, although only one will be running,
the others will in fact not even have any CPUs.
... this ...
this ^ is exactly the reboot case.
Ah, right, we're having more than one protected guest per process, so
it's all handled within the same process.
When a guest is destroyed, its memory still counts towards its
memory control group until it's actually freed (I tested this
experimentally)
When the system runs out of memory, if a guest has terminated and
its memory is being cleaned asynchronously, the OOM killer will
wait a little and then see if memory has been freed. This has the
practical effect of slowing down memory allocations when the system
is out of memory to give the cleanup thread time to cleanup and
free memory, and avoid an actual OOM situation.
... and this sound like the kind of arch MM hacks that will bite us
in the long run. Of course, I might be wrong, but already doing
excessive GFP_ATOMIC allocations or messing with the OOM killer that
they are GFP_ATOMIC but they should not put too much weight on the
memory and can also fail without consequences, I used:
GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN
also notice that after every page allocation a page gets freed, so this
is only temporary.
Correct me if I'm wrong: you're allocating unmovable pages for tracking
(e.g., ZONE_DMA, ZONE_NORMAL) from atomic reserves and will free a
movable process page, correct? Or which page will you be freeing?
I would not call it "messing with the OOM killer", I'm using the same
interface used by virtio-baloon
Right, and for virtio-balloon it's actually a workaround to restore the
original behavior of a rarely used feature: deflate-on-oom. Commit
da10329cb057 ("virtio-balloon: switch back to OOM handler for
VIRTIO_BALLOON_F_DEFLATE_ON_OOM") tried to document why we switched back
from a shrinker to VIRTIO_BALLOON_F_DEFLATE_ON_OOM:
"The name "deflate on OOM" makes it pretty clear when deflation should
happen - after other approaches to reclaim memory failed, not while
reclaiming. This allows to minimize the footprint of a guest - memory
will only be taken out of the balloon when really needed."
Note some subtle differences:
a) IIRC, before running into the OOM killer, will try reclaiming
anything else. This is what we want for deflate-on-oom, it might not
be what you want for your feature (e.g., flushing other processes/VMs
to disk/swap instead of waiting for a single process to stop).
b) Migration of movable balloon inflated pages continues working because
we are dealing with non-lru page migration.
Will page reclaim, page migration, compaction, ... of these movable LRU
pages still continue working while they are sitting around waiting to be
cleaned up? I can see that we're grabbing an extra reference when we put
them onto the list, that might be a problem: for example, we can most
certainly not swap out these pages or write them back to disk on memory
pressure.
way for a pure (shutdown) optimization is an alarm signal. Of course,
I might be wrong.
You should at least CC linux-mm. I'll do that right now and also CC
Michal. He might have time to have a quick glimpse at patch #11 and
#13.
https://lkml.kernel.org/r/20210804154046.88552-12-imbrenda@xxxxxxxxxxxxx
https://lkml.kernel.org/r/20210804154046.88552-14-imbrenda@xxxxxxxxxxxxx
IMHO, we should proceed with patch 1-10, as they solve a really
important problem ("slow reboots") in a nice way, whereby patch 11
handles a case that can be worked around comparatively easily by
management tools -- my 2 cents.
how would management tools work around the issue that a shutdown can
take very long?
The traditional approach is to wait starting a new VM on another
hypervisor instead until memory has been freed up, or start it on
another hypervisor. That raises the question about the target use case.
What I don't get is that we have to pay the price for freeing up that
memory. Why isn't it sufficient to keep the process running and let
ordinary MM do it's thing?
Maybe you should clearly spell out what the target use case for the fast
shutdown (fast quitting of the process?) is?. I assume it is, starting a
new VM / process / whatsoever on the same host immediately, and then
a) Eventually slowing down other processes due heavy reclaim.
b) Slowing down the new process because you have to pay the price of
cleaning up memory.
I think I am missing why we need the lazy destroy at all when killing a
process. Couldn't you instead teach the OOM killer "hey, we're currently
quitting a heavy process that is just *very* slow to free up memory,
please wait for that before starting shooting around" ?
--
Thanks,
David / dhildenb