Re: [libvirt PATCH v3 5/5] qemu: enable asynchronous teardown on s390x hosts by default

Boris Fiuczynski <fiuczy@xxxxxxxxxxxxx> · Mon, 10 Jul 2023 11:57:34 +0200

On 7/5/23 4:47 PM, Daniel P. Berrangé wrote:
On Wed, Jul 05, 2023 at 04:27:46PM +0200, Boris Fiuczynski wrote:
On 7/5/23 3:08 PM, Daniel P. Berrangé wrote:
On Wed, Jul 05, 2023 at 02:46:03PM +0200, Claudio Imbrenda wrote:
On Wed, 5 Jul 2023 13:26:32 +0100
Daniel P. Berrangé <berrange@xxxxxxxxxx> wrote:

[...]

I rather think mgmt apps need to explicitly opt-in to async teardown,
so they're aware that they need to take account of delayed RAM
availablity in their accounting / guest placement logic.

what would you think about enabling it by default only for guests that
are capable to run in Secure Execution mode?

IIUC, that's basically /all/ guests if running on new enough hardware
with prot_virt=1 enabled on the host OS, so will still present challenges
to mgmt apps needing to be aware of this behaviour AFAICS.

I think there is some fencing still? I don't think it's automatic

IIUC, the following sequence is possible

    1. Start QEMU with -m 500G
        -> QEMU spawns async teardown helper process
    2. Stop QEMU
        -> Async teardown helper process remains running while
           kernel releases RAM
    3. Start QEMU with -m 500G
        -> Fails with ENOMEM
    ...time passes...
    4. Async teardown helper finally terminates
        -> The full original 500G is only now released for use

Basically if you can't do

     while true
     do
        virsh start $guest
        virsh stop $guest
     done

then it is a change in libvirt API semantics, as so will require
explicit opt-in from the mgmt app to use this feature.

What is your expectation if libvirt ["virsh stop $guest"] fails to wait for
qemu to terminate e.g. after 20+ minutes. I think that libvirt does have a
timeout trying to stop qemu and than gives up.
Wouldn't you encounter the same problem that way?

Yes, that would be a bug. We've tried to address these in the past.
For example, when there are PCI host devs assigned, the kernel takes
quite a bit longer to terminate QEMU. In that case, we extended the
timeout we wait for QEMU to exit.

Essentially the idea is that when 'virsh destroy' returns we want the
caller to have a strong guarantee that all resources are released.
IOW, if it sees an error code the expectation is that QEMU has suffered
a serious problem - such as stuck in an uninterruptible  sleep in kernel
space. We don't want the caller to see errors in "normal" scenarios.

With regards,
Daniel

Daniel,
so the idea is to extend the wait until QEMU terminates?
What is your proposal how to fix the bug?

We had a scenario with a 2TB guest running NOT in Secure Execution mode 
which termination resulted in libvirt giving up on terminating the guest 
after 40 seconds (10s SIGTERM and 30s SIGKILL) and systemd was able to 
"kill" the QEMU process after about 140s.

We could add additional time depending on the guest memory size BUT with 
Secure Execution the timeout would need to be increased by factors (two 
digits). Also for libvirt it is not possible to detect if the guest is 
in Secure Execution mode.
I also assume that timeouts of +1h are not acceptable. Wouldn't a long 
timeout cause other trouble like stalling "virsh list" run in parallel?

--
Mit freundlichen Grüßen/Kind regards
   Boris Fiuczynski

IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Gregor Pillen
Geschäftsführung: David Faller
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294