Re: [PATCH 23/26] rpc: fix shutdown sequence when preserving state

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 03, 2025 at 12:04:01PM +0100, Peter Krempa wrote:
> On Wed, Jan 08, 2025 at 19:42:56 +0000, Daniel P. Berrangé wrote:
> > The preserving of state (ie running VMs) requires a fully functional
> > daemon and hypervisor driver. If any part has started shutting down
> > then saving state may fail, or worse, hang.
> > 
> > The current shutdown sequence does not guarantee safe ordering, as
> > we synchronize with the state saving thread only after the hypervisor
> > driver has had its 'shutdownPrepare' callback invoked. In the case of
> > QEMU this means that worker threads processing monitor events may well
> > have been stopped.
> > 
> > This implements a full state machine that has a well defined ordering
> > that an earlier commit documented as the desired semantics.
> > 
> > With this change, nothing will start shutting down if the state saving
> > thread is still running.
> > 
> > Signed-off-by: Daniel P. Berrangé <berrange@xxxxxxxxxx>
> > ---
> >  src/remote/remote_daemon.c |   4 +-
> >  src/rpc/virnetdaemon.c     | 103 ++++++++++++++++++++++++++-----------
> >  2 files changed, 76 insertions(+), 31 deletions(-)
> > 
> > diff --git a/src/remote/remote_daemon.c b/src/remote/remote_daemon.c
> > index e03ee1de5a..f64ecad7f5 100644
> > --- a/src/remote/remote_daemon.c
> > +++ b/src/remote/remote_daemon.c
> > @@ -617,8 +617,8 @@ static void daemonRunStateInit(void *opaque)
> >                                             NULL,
> >                                             G_DBUS_SIGNAL_FLAGS_NONE,
> >                                             handleSystemMessageFunc,
> > -                                               dmn,
> > -                                               NULL);
> > +                                           dmn,
> > +                                           NULL);
> >  
> >      /* Only now accept clients from network */
> >      virNetDaemonUpdateServices(dmn, true);
> 
> This belongs to a different commit.
> 
> 
> > diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
> > index 7f14c5420a..d1d7bae569 100644
> > --- a/src/rpc/virnetdaemon.c
> > +++ b/src/rpc/virnetdaemon.c
> > @@ -39,6 +39,15 @@
> >  
> >  VIR_LOG_INIT("rpc.netdaemon");
> >  
> > +typedef enum {
> > +    VIR_NET_DAEMON_QUIT_NONE,
> > +    VIR_NET_DAEMON_QUIT_REQUESTED,
> > +    VIR_NET_DAEMON_QUIT_PRESERVING,
> > +    VIR_NET_DAEMON_QUIT_READY,
> > +    VIR_NET_DAEMON_QUIT_WAITING,
> > +    VIR_NET_DAEMON_QUIT_COMPLETED,
> > +} virNetDaemonQuitPhase;
> 
> [...]
> 
> > @@ -753,7 +762,7 @@ virNetDaemonRun(virNetDaemon *dmn)
> >      virSystemdNotifyReady();
> >  
> >      VIR_DEBUG("dmn=%p quit=%d", dmn, dmn->quit);
> > -    while (!dmn->finished) {
> > +    while (dmn->quit != VIR_NET_DAEMON_QUIT_COMPLETED) {
> >          virNetDaemonShutdownTimerUpdate(dmn);
> >  
> >          virObjectUnlock(dmn);
> > @@ -767,17 +776,30 @@ virNetDaemonRun(virNetDaemon *dmn)
> >          virHashForEach(dmn->servers, daemonServerProcessClients, NULL);
> >  
> >          /* don't shutdown services when performing an exec-restart */
> > -        if (dmn->quit && dmn->execRestart)
> > +        if (dmn->quit == VIR_NET_DAEMON_QUIT_REQUESTED && dmn->execRestart)
> >              goto cleanup;
> >  
> > -        if (dmn->quit && dmn->finishTimer == -1) {
> > +        if (dmn->quit == VIR_NET_DAEMON_QUIT_REQUESTED) {
> > +            VIR_DEBUG("Process quit request");
> >              virHashForEach(dmn->servers, daemonServerClose, NULL);
> > +
> > +            if (dmn->shutdownPreserveThread) {
> > +                VIR_DEBUG("Shutdown preserve thread running");
> > +                dmn->quit = VIR_NET_DAEMON_QUIT_PRESERVING;
> > +            } else {
> > +                VIR_DEBUG("Ready to shutdown");
> > +                dmn->quit = VIR_NET_DAEMON_QUIT_READY;
> > +            }
> > +        }
> > +
> > +        if (dmn->quit == VIR_NET_DAEMON_QUIT_READY) {
> > +            VIR_DEBUG("Starting shutdown, running prepare");
> >              if (dmn->shutdownPrepareCb && dmn->shutdownPrepareCb() < 0)
> >                  break;
> >  
> > -            if ((dmn->finishTimer = virEventAddTimeout(30 * 1000,
> > -                                                       virNetDaemonFinishTimer,
> > -                                                       dmn, NULL)) < 0) {
> > +            if ((dmn->quitTimer = virEventAddTimeout(30 * 1000,
> > +                                                     virNetDaemonQuitTimer,
> > +                                                     dmn, NULL)) < 0) {
> >                  VIR_WARN("Failed to register finish timer.");
> >                  break;
> >              }
> > @@ -787,6 +809,9 @@ virNetDaemonRun(virNetDaemon *dmn)
> >                  VIR_WARN("Failed to register join thread.");
> >                  break;
> >              }
> > +
> > +            VIR_DEBUG("Waiting for shutdown completion");
> > +            dmn->quit = VIR_NET_DAEMON_QUIT_WAITING;
> 
> [1]
> 
> 
> > @@ -808,7 +833,7 @@ virNetDaemonQuit(virNetDaemon *dmn)
> >      VIR_LOCK_GUARD lock = virObjectLockGuard(dmn);
> >  
> >      VIR_DEBUG("Quit requested %p", dmn);
> > -    dmn->quit = true;
> > +    dmn->quit = VIR_NET_DAEMON_QUIT_REQUESTED;
> >  }
> >  
> >  
> > @@ -818,7 +843,7 @@ virNetDaemonQuitExecRestart(virNetDaemon *dmn)
> >      VIR_LOCK_GUARD lock = virObjectLockGuard(dmn);
> >  
> >      VIR_DEBUG("Exec-restart requested %p", dmn);
> > -    dmn->quit = true;
> > +    dmn->quit = VIR_NET_DAEMON_QUIT_REQUESTED;
> >      dmn->execRestart = true;
> >  }
> 
> 
> These two seem to be apart from other cases also be registered in
> various signal (SIGTERM/SIGINT/etc) handlers. As the signal handler code
> we use doesn't seem to interlock multiple deliveries of said signals it
> seems that these will be able to mess up the state of the state machine
> if a 'well-timed' signal hits. Specifically they could clear
> VIR_NET_DAEMON_QUIT_PRESERVING or VIR_NET_DAEMON_QUIT_WAITING.
> 
> When VIR_NET_DAEMON_QUIT_PRESERVING will be overwritten it will get
> fixed in the next iteration of the worker
> loop in virNetDaemonRun(), but will re-notify systemd about the shutdown
> and re-stop the 'server's.
> 
> What's most likely worse is that [1] VIR_NET_DAEMON_QUIT_WAITING can
> be lost as well, which would re-register the timer and start another
> shutdown thread.

We can't race because signals are queued and dispatched from the
main event thread. We must, however, check that
quit == VIR_NET_DAEMON_QUIT_NONE before setting it to
VIR_NET_DAEMON_QUIT_REQUESTED.

> 
> >  
> > @@ -827,12 +852,21 @@ static void virNetDaemonPreserveWorker(void *opaque)
> >  {
> >      virNetDaemon *dmn = opaque;
> >  
> > -    VIR_DEBUG("Begin stop dmn=%p", dmn);
> > +    VIR_DEBUG("Begin preserve dmn=%p", dmn);
> >  
> >      dmn->shutdownPreserveCb();
> >  
> > -    VIR_DEBUG("Completed stop dmn=%p", dmn);
> > +    VIR_DEBUG("Completed preserve dmn=%p", dmn);
> >  
> > +    VIR_WITH_OBJECT_LOCK_GUARD(dmn) {
> > +        if (dmn->quit == VIR_NET_DAEMON_QUIT_PRESERVING) {
> > +            VIR_DEBUG("Marking shutdown as ready");
> > +            dmn->quit = VIR_NET_DAEMON_QUIT_READY;
> > +        }
> > +        g_clear_pointer(&dmn->shutdownPreserveThread, g_free);
> > +    }
> > +
> > +    VIR_DEBUG("End preserve dmn=%p", dmn);
> >      virObjectUnref(dmn);
> >  }
> >  
> > @@ -840,15 +874,26 @@ static void virNetDaemonPreserveWorker(void *opaque)
> >  void virNetDaemonPreserve(virNetDaemon *dmn)
> >  {
> >      VIR_LOCK_GUARD lock = virObjectLockGuard(dmn);
> > +    VIR_DEBUG("Preserve state request");
> >  
> > -    if (!dmn->shutdownPreserveCb ||
> > -        dmn->shutdownPreserveThread)
> > +    if (!dmn->shutdownPreserveCb) {
> > +        VIR_DEBUG("No preserve callback registered");
> >          return;
> > +    }
> > +    if (dmn->shutdownPreserveThread) {
> > +        VIR_DEBUG("Preserve state thread already running");
> > +        return;
> > +    }
> > +
> > +    if (dmn->quit != VIR_NET_DAEMON_QUIT_NONE) {
> > +        VIR_WARN("Already initiated shutdown sequence, unable to preserve state");
> > +        return;
> > +    }
> >  
> >      virObjectRef(dmn);
> >      dmn->shutdownPreserveThread = g_new0(virThread, 1);
> >  
> > -    if (virThreadCreateFull(dmn->shutdownPreserveThread, true,
> > +    if (virThreadCreateFull(dmn->shutdownPreserveThread, false,
> 
> This looks like a bugfix to a previous commit as the thread was never
> actually joined.

The previous code would join with the shutdownPreserveThread
in order to synchronize the shutdown sequence. This logic
was broken, and in this patch we're making the thread itself
update 'dmn->quit' when it is done, so it can exit without
anything needing to join to it. All synchronization is now via
the 'dmn->quit' state machine. Thus we toggle to mark this thread
as non-joinable in this patch

> 
> >                              virNetDaemonPreserveWorker,
> >                              "daemon-stop", false, dmn) < 0) {
> >          virObjectUnref(dmn);
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




[Index of Archives]     [Virt Tools]     [Libvirt Users]     [Lib OS Info]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite News]     [KDE Users]     [Fedora Tools]

  Powered by Linux