On Tue, 18 May 2021 18:20:22 +0200 Christian Borntraeger <borntraeger@xxxxxxxxxx> wrote: > On 18.05.21 18:13, Claudio Imbrenda wrote: > > On Tue, 18 May 2021 17:45:18 +0200 > > Christian Borntraeger <borntraeger@xxxxxxxxxx> wrote: > > > >> On 18.05.21 17:36, Claudio Imbrenda wrote: > >>> On Tue, 18 May 2021 17:05:37 +0200 > >>> Cornelia Huck <cohuck@xxxxxxxxxx> wrote: > >>> > >>>> On Mon, 17 May 2021 22:07:47 +0200 > >>>> Claudio Imbrenda <imbrenda@xxxxxxxxxxxxx> wrote: > >>>> > >>>>> Previously, when a protected VM was rebooted or when it was shut > >>>>> down, its memory was made unprotected, and then the protected VM > >>>>> itself was destroyed. Looping over the whole address space can > >>>>> take some time, considering the overhead of the various > >>>>> Ultravisor Calls (UVCs). This means that a reboot or a shutdown > >>>>> would take a potentially long amount of time, depending on the > >>>>> amount of used memory. > >>>>> > >>>>> This patchseries implements a deferred destroy mechanism for > >>>>> protected guests. When a protected guest is destroyed, its > >>>>> memory is cleared in background, allowing the guest to restart > >>>>> or terminate significantly faster than before. > >>>>> > >>>>> There are 2 possibilities when a protected VM is torn down: > >>>>> * it still has an address space associated (reboot case) > >>>>> * it does not have an address space anymore (shutdown case) > >>>>> > >>>>> For the reboot case, the reference count of the mm is increased, > >>>>> and then a background thread is started to clean up. Once the > >>>>> thread went through the whole address space, the protected VM is > >>>>> actually destroyed. > >>>>> > >>>>> For the shutdown case, a list of pages to be destroyed is formed > >>>>> when the mm is torn down. Instead of just unmapping the pages > >>>>> when the address space is being torn down, they are also set > >>>>> aside. Later when KVM cleans up the VM, a thread is started to > >>>>> clean up the pages from the list. > >>>> > >>>> Just to make sure, 'clean up' includes doing uv calls? > >>> > >>> yes > >>> > >>>>> > >>>>> This means that the same address space can have memory belonging > >>>>> to more than one protected guest, although only one will be > >>>>> running, the others will in fact not even have any CPUs. > >>>> > >>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly > >>>> accessible in any way? I would assume that they only belong to > >>>> the > >>> > >>> in case of reboot: yes, they are still in the address space of the > >>> guest, and can be swapped if needed > >>> > >>>> 'zombie' guests, and any new or rebooted guest is a new entity > >>>> that needs to get new pages? > >>> > >>> the rebooted guest (normal or secure) will re-use the same pages > >>> of the old guest (before or after cleanup, which is the reason of > >>> patches 3 and 4) > >>> > >>> the KVM guest is not affected in case of reboot, so the userspace > >>> address space is not touched. > >>> > >>>> Can too many not-yet-cleaned-up pages lead to a (temporary) > >>>> memory exhaustion? > >>> > >>> in case of reboot, not much; the pages were in use are still in > >>> use after the reboot, and they can be swapped. > >>> > >>> in case of a shutdown, yes, because the pages are really taken > >>> aside and cleared/destroyed in background. they cannot be > >>> swapped. they are freed immediately as they are processed, to try > >>> to mitigate memory exhaustion scenarios. > >>> > >>> in the end, this patchseries is a tradeoff between speed and > >>> memory consumption. the memory needs to be cleared up at some > >>> point, and that requires time. > >>> > >>> in cases where this might be an issue, I introduced a new KVM flag > >>> to disable lazy destroy (patch 10) > >> > >> Maybe we could piggy-back on the OOM-kill notifier and then fall > >> back to synchronous freeing for some pages? > > > > I'm not sure I follow > > > > once the pages have been set aside, it's too late > > > > while the pages are being set aside, every now and then some memory > > needs to be allocated. the allocation is atomic, not allowed to use > > emergency reserves, and can fail without warning. if the allocation > > fails, we clean up one page and continue, without setting aside > > anything (patch 9) > > > > so if the system is low on memory, the lazy destroy should not make > > the situation too much worse. > > > > the only issue here is starting a normal process in the host (maybe > > a non secure guest) that uses a lot of memory very quickly, right > > after a large secure guest has terminated. > > I think page cache page allocations do not need to be atomic. > In that case the kernel might stil l decide to trigger the oom > killer. We can let it notify ourselves free 256 pages synchronously > and avoid the oom kill. Have a look at the virtio-balloon > virtio_balloon_oom_notify the issue is that once the pages have been set aside, it's too late. the OOM notifier would only be useful if we get notified of the OOM situation _while_ setting aside the pages. unless you mean that the notifier should simply wait until the thread has done (some of) its work?