Re: [PATCH REBASE 4/5] qemu: fix domain object wait to handle monitor errors

Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx> · Mon, 7 May 2018 11:43:18 +0300

On 04.05.2018 17:52, John Ferlan wrote:
> 
> 
> On 05/03/2018 03:54 AM, Nikolay Shirokovskiy wrote:
>>
>>
>> On 01.05.2018 01:03, John Ferlan wrote:
>>>
>>>
>>> On 04/18/2018 10:44 AM, Nikolay Shirokovskiy wrote:
>>>> Block job abort operation can not handle properly qemu crashes
>>>> when waiting for abort/pivot completion. Deadlock scenario is next:
>>>>
>>>> - qemuDomainBlockJobAbort waits for pivot/abort completion
>>>> - qemu crashes, then qemuProcessBeginStopJob broadcasts for VM condition and
>>>>   then waits for job condition (taken by qemuDomainBlockJobAbort)
>>>> - qemuDomainBlockJobAbort awakes but nothing really changed, VM is still
>>>>   active (vm->def->id != -1) so thread starts waiting for completion again.
>>>>   Now two threads are in deadlock.
>>>>
>>>> First let's add some condition besides domain active status so that waiting
>>>> thread can check it when awakes. Second let's move signalling to the place
>>>> where condition is set, namely monitor eof/error handlers. Having signalling
>>>> in qemuProcessBeginStopJob is not useful.
>>>>
>>>> The patch copies monitor error to domain state because at time
>>>> waiting thread awakes there can be no monitor and it is useful to
>>>> report monitor error to user.
>>>>
>>>> The patch has a drawback that on destroying a domain when waiting for
>>>> event from monitor we get not very convinient error message like
>>>
>>> convenient
>>>
>>>> 'EOF from monitor' from waiting API. On the other hand if qemu crashes
>>>> this is more useful then 'domain is not running'. The first case
>>>> will be addressed in another patch.
>>>>
>>>> The patch also fixes other places where we wait for event from qemu.
>>>> Namely handling device removal and tray ejection. Other places which
>>>> used virDomainObjWait (dump, migration, save) were good because of
>>>> async jobs which allow concurrent destroy job.
>>>>
>>>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx>
>>>> ---
>>>>  src/conf/domain_conf.c    | 43 -------------------------------------------
>>>>  src/conf/domain_conf.h    |  3 ---
>>>>  src/libvirt_private.syms  |  2 --
>>>>  src/qemu/qemu_domain.c    | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>>>  src/qemu/qemu_domain.h    |  5 ++++-
>>>>  src/qemu/qemu_driver.c    |  6 +++---
>>>>  src/qemu/qemu_hotplug.c   |  4 ++--
>>>>  src/qemu/qemu_migration.c | 12 ++++++------
>>>>  src/qemu/qemu_process.c   | 27 ++++++++++++++++++++++-----
>>>>  9 files changed, 82 insertions(+), 65 deletions(-)
>>>>
>>>
>>> This no longer git am -3 applies and based on previous patch reviews, I
>>> think perhaps this needs to be redone.
>>
>> I'll resend as soon as we come to agreement on series. Now I'm not
>> convinced to change much (except for using distinct flag to indicate error
>> as mentioned in 2nd patch discussion, then I don't need 3d patch).
>>
>> See comments below.
>>
>>>
>>> I don't believe moving/renaming virDomainObjWait is good/right, but I'm
>>> sure there would be other opinions on that.  Yes, QEMU is currently the
>>> only consumer... If it were to move, it should move to virdomainobjlist
>>> since that's where innards of virDomainObjPtr are managed. The fact that
>>> we look at ->parent.lock, well, <sigh>...
>>
>> I would not move the function without reason. It gets new name qemuDomainObjWait
>> becase it use qemu specific data, namely monError.
> 
> Today only qemu uses this generic virDomainObjWait which is generic
> without the need to have/use qemu specific things. Other domain
> consumers could use it if they chose.
> 
> I'm not in favor of moving, renaming, repurposing for a very specific
> issue for what is a generically used function. Maybe that's just me - so
> you could try to get a different reviewer if you want.

Sorry.

> 
> 
>>
>>>
>>> Anyway, you're attempting to special case something and perhaps you just
>>> need to create a qemuDomainObjWait that would call the timeout version
>>> of the virDomainObjWait and be able to handle whether some other error
>>> occurred.  Wouldn't the LastError before the current wait return NULL
>>> (as in no error), then during the timeout period if LastError is
>>> something couldn't that indicate the failure you're looking for.
>>
>> I introduced qemuDomainObjWait not to put virDomainObjWait and virDomainObjWaitUntil
>> in the first place but to check monError. Then rather then keeping too functions
>> it's look more simple to use only one and branch on given timeout.
>>
> 
> I would think if the goal was to have specific code for a specific
> purpose for specific functions, then introduce the qemuDomainObjWait,
> but have it call virDomainObjWait[Until] based on the need you have.
> Which in this case appears to fiddle with monError in some way.
> 
> Again - that's just my opinion
> 
> John
> 

Ok we can do that way. I just think that as virDomainObjWait/virDomainObjWaitUntil
will not be used anywhere in code after the patch then sooner or later somebody will
clean them up.

Nikolay

>>>
>>> I didn't spend a lot of time thinking about alternatives and how the
>>> code should change, but when you mention the patch has a drawback - that
>>> immediately raises concern.
>>
>> But ... I addressed this issue in next patch as I wrote.
>>
> 
> 
> [...]
> 

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list