On Fri, 6 Aug 2021 09:10:28 +0200 David Hildenbrand <david@xxxxxxxxxx> wrote: > On 04.08.21 17:40, Claudio Imbrenda wrote: > > Previously, when a protected VM was rebooted or when it was shut > > down, its memory was made unprotected, and then the protected VM > > itself was destroyed. Looping over the whole address space can take > > some time, considering the overhead of the various Ultravisor Calls > > (UVCs). This means that a reboot or a shutdown would take a > > potentially long amount of time, depending on the amount of used > > memory. > > > > This patchseries implements a deferred destroy mechanism for > > protected guests. When a protected guest is destroyed, its memory > > is cleared in background, allowing the guest to restart or > > terminate significantly faster than before. > > > > There are 2 possibilities when a protected VM is torn down: > > * it still has an address space associated (reboot case) > > * it does not have an address space anymore (shutdown case) > > > > For the reboot case, the reference count of the mm is increased, and > > then a background thread is started to clean up. Once the thread > > went through the whole address space, the protected VM is actually > > destroyed. > > That doesn't sound too hacky to me, and actually sounds like a good > idea, doing what the guest would do either way but speeding it up > asynchronously, but ... > > > > > For the shutdown case, a list of pages to be destroyed is formed > > when the mm is torn down. Instead of just unmapping the pages when > > the address space is being torn down, they are also set aside. > > Later when KVM cleans up the VM, a thread is started to clean up > > the pages from the list. > > ... this ... > > > > > This means that the same address space can have memory belonging to > > more than one protected guest, although only one will be running, > > the others will in fact not even have any CPUs. > > ... this ... this ^ is exactly the reboot case. > > When a guest is destroyed, its memory still counts towards its > > memory control group until it's actually freed (I tested this > > experimentally) > > > > When the system runs out of memory, if a guest has terminated and > > its memory is being cleaned asynchronously, the OOM killer will > > wait a little and then see if memory has been freed. This has the > > practical effect of slowing down memory allocations when the system > > is out of memory to give the cleanup thread time to cleanup and > > free memory, and avoid an actual OOM situation. > > ... and this sound like the kind of arch MM hacks that will bite us > in the long run. Of course, I might be wrong, but already doing > excessive GFP_ATOMIC allocations or messing with the OOM killer that they are GFP_ATOMIC but they should not put too much weight on the memory and can also fail without consequences, I used: GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN also notice that after every page allocation a page gets freed, so this is only temporary. I would not call it "messing with the OOM killer", I'm using the same interface used by virtio-baloon > way for a pure (shutdown) optimization is an alarm signal. Of course, > I might be wrong. > > You should at least CC linux-mm. I'll do that right now and also CC > Michal. He might have time to have a quick glimpse at patch #11 and > #13. > > https://lkml.kernel.org/r/20210804154046.88552-12-imbrenda@xxxxxxxxxxxxx > https://lkml.kernel.org/r/20210804154046.88552-14-imbrenda@xxxxxxxxxxxxx > > IMHO, we should proceed with patch 1-10, as they solve a really > important problem ("slow reboots") in a nice way, whereby patch 11 > handles a case that can be worked around comparatively easily by > management tools -- my 2 cents. how would management tools work around the issue that a shutdown can take very long? also, without my patches, the shutdown case would use export instead of destroy, making it even slower.