Re: [RFC] Add API to change qemu agent response timeout

Michal Privoznik <mprivozn@xxxxxxxxxx> · Fri, 4 Oct 2019 11:48:08 +0200

On 10/4/19 8:14 AM, Peter Krempa wrote:
On Thu, Oct 03, 2019 at 16:52:12 -0500, Jonathon Jongsma wrote:
Some layered products such as oVirt have requested a way to avoid being
blocked by guest agent commands when querying a loaded vm. For example,
many guest agent commands are polled periodically to monitor changes,
and rather than blocking the calling process, they'd prefer to simply
time out when an agent query is taking too long.

This patch adds a way for the user to specify a custom agent timeout
that is applied to all agent commands.

One special case to note here is the 'guest-sync' command. 'guest-sync'
is issued internally prior to calling any other command. (For example,
when libvirt wants to call 'guest-get-fsinfo', we first call
'guest-sync' and then call 'guest-get-fsinfo').

Previously, the 'guest-sync' command used a 5-second timeout
(VIR_DOMAIN_QEMU_AGENT_COMMAND_DEFAULT), whereas the actual command that
followed always blocked indefinitely
(VIR_DOMAIN_QEMU_AGENT_COMMAND_BLOCK). As part of this patch, if a
custom timeout is specified that is shorter than
5 seconds,  this new timeout also used for 'guest-sync'. If there is no
custom timeout or if the custom timeout is longer than 5 seconds, we
will continue to use the 5-second timeout.

See https://bugzilla.redhat.com/show_bug.cgi?id=1705426 for additional details.

Signed-off-by: Jonathon Jongsma <jjongsma@xxxxxxxxxx>
---
  include/libvirt/libvirt-qemu.h |  2 +
  src/driver-hypervisor.h        |  5 +++
  src/libvirt-qemu.c             | 40 ++++++++++++++++++++
  src/libvirt_qemu.syms          |  4 ++
  src/qemu/qemu_agent.c          | 69 +++++++++++++++++++---------------
  src/qemu/qemu_agent.h          |  3 ++
  src/qemu/qemu_driver.c         | 24 ++++++++++++
  src/qemu_protocol-structs      |  8 ++++
  src/remote/qemu_protocol.x     | 18 ++++++++-
  src/remote/remote_driver.c     |  1 +
  10 files changed, 143 insertions(+), 31 deletions(-)

[...]

@@ -2737,3 +2730,19 @@ qemuAgentGetTimezone(qemuAgentPtr mon,
  
      return 0;
  }
+
+int
+qemuAgentSetTimeout(qemuAgentPtr mon,
+                    int timeout)
+{
+    if (timeout < VIR_DOMAIN_QEMU_AGENT_COMMAND_MIN) {
+        virReportError(VIR_ERR_INVALID_ARG,
+                       _("guest agent timeout '%d' is "
+                         "less than the minimum '%d'"),
+                       timeout, VIR_DOMAIN_QEMU_AGENT_COMMAND_MIN);

This error is misleading as -1 and -2 are special values and not actual
timeout.

+        return -1;
+    }
+
+    mon->timeout = timeout;
+    return 0;
+}

[...]

diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index 1e041a8bac..09251cc9e2 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -23434,6 +23434,29 @@ qemuDomainGetGuestInfo(virDomainPtr dom,
      return ret;
  }
  
+static int
+qemuDomainQemuAgentSetTimeout(virDomainPtr dom,
+                              int timeout)
+{
+    virDomainObjPtr vm = NULL;
+    qemuAgentPtr agent;
+    int ret = -1;
+
+    if (!(vm = qemuDomObjFromDomain(dom)))
+        goto cleanup;
+
+    if (virDomainQemuAgentSetTimeoutEnsureACL(dom->conn, vm->def) < 0)
+        goto cleanup;
+
+    agent = qemuDomainObjEnterAgent(vm);

You must acquire a job on @vm if you want to call this:

/*
  * obj must be locked before calling
  *
  * To be called immediately before any QEMU agent API call.
  * Must have already called qemuDomainObjBeginAgentJob() or
  * qemuDomainObjBeginJobWithAgent() and checked that the VM is
  * still active.
  *
  * To be followed with qemuDomainObjExitAgent() once complete
  */
qemuAgentPtr
qemuDomainObjEnterAgent(virDomainObjPtr obj)


+    ret = qemuAgentSetTimeout(agent, timeout);
+    qemuDomainObjExitAgent(vm, agent);

Also this API is inherently racy if you have two clients setting the
timeout and it will influence calls of a different client.

How is this different to one client calling virDomainCreate() whilst the 
other calls virDomainDestroy() over the same domain? I believe our APIs 
are build on assumption that management layer calls only those APIs and 
in the right sequence so that desired state of things is achieved. If 
we'd have two management apps competing over the same domain then agent 
timeout is one of the smaller problems IMO.


IMO the only reasonable approach is to add new APIs which have a
'timeout' parameter for any agent API which requires tunable timeout to
prevent any races and unexpected behaviour.

Thing is, we don't tell users which APIs will talk to monitor and which 
will talk to the agent. For instance, qemuDomainShutdownFlags() prefers 
agent, but it wasn't always like that. And even if we did tell users 
that, we would need to document that @timeout applies to guest agent 
only and might not be honoured if API decides to talk to monitor.


Other possibility may be to add a qemu config file option for this but
that is not really dynamic.

I'd rather see this patch in than yet another qemu.conf knob.

However, reading the linked bug, it looks like to fix the problem 
described there a much simpler solution is viable.

Michal

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list