Re: [PATCH] libvirt: add daemon itself as shutdown reason

Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx> · Wed, 17 Oct 2018 08:16:38 +0000

On 16.10.2018 15:48, John Ferlan wrote:
> 
> 
> On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
>> Let's introduce shutdown reason "daemon" which means we have to
>> kill running domain ourselves as the best action we can take at
>> that moment. Failure to pick up domain on daemon restart is
>> one example of such case. Using reason "crashed" is a bit misleading
>> as it is used when qemu is actually crashed or qemu destroyed on
>> guest panic as result of user choice that is the problem was
>> in qemu/guest itself. So I propose to use "crashed" only for
>> qemu side issues and introduce "daemon" when we have to kill the qemu
>> for good.
> 
> How about (although reading below may shed some more context):
> 
> This patch introduces a new shutdown reason "daemon" in order
> to indicate that the daemon needed to force shutdown the domain
> as the best course of action to take at the moment.
> 
> This action would occur during Reconnect processing when it's
> determined that either the QEMU monitor no longer exists for
> some unknown reason or creation of a thread to reconnect the
> domain failed.

Mostly I'm fine with this one except "monitor no longer exists"
is condition for reason "crashed" or "unknown" (Martin's contribution)

> 
>>
>> There is another example where "daemon" will be useful. If we can
>> not reboot domain we kill it and got "crashed" reason now. Looks
>> like good candidate for "daemon" reason.
> 
> If you feel this way, then a followup patch could/should be posted;
> otherwise, this'll be lost to commit message heaven where all good ideas
> go to die ;-).

Sure!

> 
>>
>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx>
>> ---
>>  include/libvirt/libvirt-domain.h |  1 +
>>  src/conf/domain_conf.c           |  3 ++-
>>  src/qemu/qemu_process.c          | 11 ++++-------
>>  tools/virsh-domain-monitor.c     |  3 ++-
>>  4 files changed, 9 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
>> index fdd2d6b..11fdab5 100644
>> --- a/include/libvirt/libvirt-domain.h
>> +++ b/include/libvirt/libvirt-domain.h
>> @@ -145,6 +145,7 @@ typedef enum {
>>      VIR_DOMAIN_SHUTOFF_FAILED = 6,      /* domain failed to start */
>>      VIR_DOMAIN_SHUTOFF_FROM_SNAPSHOT = 7, /* restored from a snapshot which was
>>                                             * taken while domain was shutoff */
>> +    VIR_DOMAIN_SHUTOFF_DAEMON = 8,      /* daemon have to kill domain */
> 
> s/have to/decides to/
> 
>>  # ifdef VIR_ENUM_SENTINELS
>>      VIR_DOMAIN_SHUTOFF_LAST
>>  # endif
>> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
>> index 9911d56..e441723 100644
>> --- a/src/conf/domain_conf.c
>> +++ b/src/conf/domain_conf.c
>> @@ -779,7 +779,8 @@ VIR_ENUM_IMPL(virDomainShutoffReason, VIR_DOMAIN_SHUTOFF_LAST,
>>                "migrated",
>>                "saved",
>>                "failed",
>> -              "from snapshot")
>> +              "from snapshot",
>> +              "daemon")
>>  
>>  VIR_ENUM_IMPL(virDomainCrashedReason, VIR_DOMAIN_CRASHED_LAST,
>>                "unknown",
>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>> index e9c7618..c4bc9ca 100644
>> --- a/src/qemu/qemu_process.c
>> +++ b/src/qemu/qemu_process.c
>> @@ -7988,15 +7988,12 @@ qemuProcessReconnect(void *opaque)
>>          /* We can't get the monitor back, so must kill the VM
>>           * to remove danger of it ending up running twice if
>>           * user tries to start it again later
> 
> Since we're changing anyway, let's put the stop there, e.g.
> "s/later/later./"
> 
>> -         * If we couldn't get the monitor since QEMU supports
>> -         * no-shutdown, we can safely say that the domain
>> -         * crashed ... */
>> -        state = VIR_DOMAIN_SHUTOFF_CRASHED;
>> -        /* If BeginJob failed, we jumped here without a job, let's hope another
>> +         * If BeginJob failed, we jumped here without a job, let's hope another
>>           * thread didn't have a chance to start playing with the domain yet
>>           * (it's all we can do anyway).
>>           */
>> -        qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);
>> +        qemuProcessStop(driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>> +                        QEMU_ASYNC_JOB_NONE, stopFlags);
>>      }
>>      goto cleanup;
>>  }
>> @@ -8035,7 +8032,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>>           * is no thread that could be doing anything else with the same domain
>>           * object.
>>           */
>> -        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
>> +        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>                          QEMU_ASYNC_JOB_NONE, 0);
>>          qemuDomainRemoveInactiveJobLocked(src->driver, obj);
>>  
> 
> When qemuProcessReconnectHelper was introduced (commit d38897a5d)
> reconnection failure used VIR_DOMAIN_SHUTOFF_FAILED; however, that was
> changed in commit bda2f17d to either VIR_DOMAIN_SHUTOFF_CRASHED or
> VIR_DOMAIN_SHUTOFF_UNKNOWN. When QEMU_CAPS_NO_SHUTDOWN checking was
> removed in commit fe35b1ad6 the conditional state was just left at
> VIR_DOMAIN_SHUTOFF_CRASHED.
> 
> Still I agree w/ Martin's thoughts, which unfortunately wasn't commented
> on in the code leading to the latter revert commit losing that unknown
> reason. The reason for the "-no-shutdown" *did* change and that probably
> should have been reflected in the qemuProcessReconnect logic.
> 
> So I think perhaps we should have an initial patch which references or
> uses that first paragraph in order to change the code to:
> 
>         /* We cannot get the monitor back, so kill the VM to
>          * remove possibility of it ending up running twice if
>          * user tries to start it again later.
>          *
>          * If we cannot get to the monitor when the QEMU command
>          * line used -no-shutdown, then we can safely say that the
>          * domain crashed; otherwise we don't really know. */
>         if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
>             state = VIR_DOMAIN_SHUTOFF_CRASHED;
>         else
>             state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
> 
> and uses the short commit msg of "qemu: Restore lost shutdown reason".
> 
> Then this followup would change the "UNKNOWN" to "DAEMON" keeping the
> CRASHED otherwise.
> 

Then we will get "crashed" reason always for modern qemu and domains
that can reboot which is not true.

However I see now that the issue is more complicated then I thought. I agree
with Martin too but we have to additionally check that connect(2) for domain
monitor socket was called and error was ECONNREFUSED. So finally state is
resolved like this:

if (connect(2) was not called) {
    state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
else if (connect(2) result is NOT ECONNREFUSED) {
    state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
} else if (connect(2) result is ECONNREFUSED) {
    if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
        state = VIR_DOMAIN_SHUTOFF_CRASHED;
    else
        state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
} else {
    state = VIR_DOMAIN_SHUTOFF_DAEMON;
}

Moreover we should not later kill process by remembered pid in all cases except
last one in the above code because we can kill some other process.

Looks like a bit complicated, not sure should we pursuit this one or just left
VIR_DOMAIN_SHUTOFF_UNKNOWN only (in this case we can introduce
VIR_DOMAIN_SHUTOFF_DAEMON for reboot failures where reason is obvisous).

Nikolay

> 
>> diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c
>> index 3a26363..f0ad558 100644
>> --- a/tools/virsh-domain-monitor.c
>> +++ b/tools/virsh-domain-monitor.c
>> @@ -212,7 +212,8 @@ VIR_ENUM_IMPL(virshDomainShutoffReason,
>>                N_("migrated"),
>>                N_("saved"),
>>                N_("failed"),
>> -              N_("from snapshot"))
>> +              N_("from snapshot"),
>> +              N_("daemon"))
>>  
>>  VIR_ENUM_DECL(virshDomainCrashedReason)
>>  VIR_ENUM_IMPL(virshDomainCrashedReason,
>>

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list