Re: [PATCH v3 00/14] KVM: s390: pv: implement lazy destroy

Claudio Imbrenda <imbrenda@xxxxxxxxxxxxx> · Fri, 6 Aug 2021 11:30:05 +0200

On Fri, 6 Aug 2021 09:10:28 +0200
David Hildenbrand <david@xxxxxxxxxx> wrote:

> On 04.08.21 17:40, Claudio Imbrenda wrote:
> > Previously, when a protected VM was rebooted or when it was shut
> > down, its memory was made unprotected, and then the protected VM
> > itself was destroyed. Looping over the whole address space can take
> > some time, considering the overhead of the various Ultravisor Calls
> > (UVCs). This means that a reboot or a shutdown would take a
> > potentially long amount of time, depending on the amount of used
> > memory.
> > 
> > This patchseries implements a deferred destroy mechanism for
> > protected guests. When a protected guest is destroyed, its memory
> > is cleared in background, allowing the guest to restart or
> > terminate significantly faster than before.
> > 
> > There are 2 possibilities when a protected VM is torn down:
> > * it still has an address space associated (reboot case)
> > * it does not have an address space anymore (shutdown case)
> > 
> > For the reboot case, the reference count of the mm is increased, and
> > then a background thread is started to clean up. Once the thread
> > went through the whole address space, the protected VM is actually
> > destroyed.  
> 
> That doesn't sound too hacky to me, and actually sounds like a good 
> idea, doing what the guest would do either way but speeding it up 
> asynchronously, but ...
> 
> > 
> > For the shutdown case, a list of pages to be destroyed is formed
> > when the mm is torn down. Instead of just unmapping the pages when
> > the address space is being torn down, they are also set aside.
> > Later when KVM cleans up the VM, a thread is started to clean up
> > the pages from the list.  
> 
> ... this ...
> 
> > 
> > This means that the same address space can have memory belonging to
> > more than one protected guest, although only one will be running,
> > the others will in fact not even have any CPUs.  
> 
> ... this ...

this ^ is exactly the reboot case.

> > When a guest is destroyed, its memory still counts towards its
> > memory control group until it's actually freed (I tested this
> > experimentally)
> > 
> > When the system runs out of memory, if a guest has terminated and
> > its memory is being cleaned asynchronously, the OOM killer will
> > wait a little and then see if memory has been freed. This has the
> > practical effect of slowing down memory allocations when the system
> > is out of memory to give the cleanup thread time to cleanup and
> > free memory, and avoid an actual OOM situation.  
> 
> ... and this sound like the kind of arch MM hacks that will bite us
> in the long run. Of course, I might be wrong, but already doing
> excessive GFP_ATOMIC allocations or messing with the OOM killer that

they are GFP_ATOMIC but they should not put too much weight on the
memory and can also fail without consequences, I used:

GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN

also notice that after every page allocation a page gets freed, so this
is only temporary.

I would not call it "messing with the OOM killer", I'm using the same
interface used by virtio-baloon

> way for a pure (shutdown) optimization is an alarm signal. Of course,
> I might be wrong.
> 
> You should at least CC linux-mm. I'll do that right now and also CC 
> Michal. He might have time to have a quick glimpse at patch #11 and
> #13.
> 
> https://lkml.kernel.org/r/20210804154046.88552-12-imbrenda@xxxxxxxxxxxxx
> https://lkml.kernel.org/r/20210804154046.88552-14-imbrenda@xxxxxxxxxxxxx
> 
> IMHO, we should proceed with patch 1-10, as they solve a really 
> important problem ("slow reboots") in a nice way, whereby patch 11 
> handles a case that can be worked around comparatively easily by 
> management tools -- my 2 cents.

how would management tools work around the issue that a shutdown can
take very long?

also, without my patches, the shutdown case would use export instead of
destroy, making it even slower.