Re: [PATCH 2/3] qemu: special error code in case of no job on cancel block job

John Ferlan <jferlan@xxxxxxxxxx> · Thu, 29 Sep 2016 17:11:45 -0400

On 09/27/2016 05:06 AM, Nikolay Shirokovskiy wrote:
> 
> 
> On 26.09.2016 23:07, John Ferlan wrote:
>>
>>
>> On 09/07/2016 05:24 AM, Nikolay Shirokovskiy wrote:
>>> Special error code helps gracefully handle race conditions on
>>> blockjob cancelling. Consider for example pull backup. We want
>>> to stop it and in the process we cancel blockjob for some of the
>>> exported disks. But at the time this command reaches qemu blockjob
>>> fail for some reason and cancel result and result of stopping
>>> will be an error with quite unexpecte message - no such job (sort of).
>>> Instead we can detect this race and treat as correct cancelling
>>> and wait until fail event will arrive. Then we can report corect
>>> error message with some details from qemu probably.
>>>
>>> Users of domainBlockJobAbort can be affected as -2 is new possible
>>> return code. Not sure if I should stick to the original behaviour in
>>> this case.
>>> ---
>>>  src/qemu/qemu_monitor.c      |  5 +++++
>>>  src/qemu/qemu_monitor_json.c | 11 +++++++----
>>>  2 files changed, 12 insertions(+), 4 deletions(-)
>>>
>>
>> I haven't looked forward to patch 3/3 yet, but perhaps it'd be better in
>> the "caller" that cares to :
>>
>>     if (qemu*() < 0) {
>>         virErrorPtr err = virGetLastError();
>>         if (err && err->code == VIR_ERR_OPERATION_INVALID) {
>>             /* Do something specific */
>>         }
>>     }
>>
>> Returning -2 alters virDomainBlockJobAbort expected return values and
>> well that's probably not a good idea since we don't know if some caller
>> has said (if virDomainBlockJobAbort() == -1) instead of <= -1
>>
> 
> It is common place in libvirt to have special error codes instead of 0/-1
> only (qemuMonitorJSONCheckCPUx86) and it feels bulky to have checks that
> involve virGetLastError. It looks like other places that use similar 
> construction do not have other choice because of rpc call or smth.

True - my concern was more that virDomainBlockJobAbort now has a new
possible error value of -2 that's not described.  Still just adding to
the documentation that it can now return -1 or -2 probably isn't the
best idea especially if what you're trying to do is resolve/handle an
issue for a specific case within qemu code.

Other API's within the bowels of qemu (monitor, monitor_json, etc)
returning -2 usually is somehow handled by the caller, but before
returning out into what would be API land would return what the API has
documented ({0, -1}, {true, false}, {NULL, address}, {count, 0, -1}, etc.)

Changing an external API return value is a risky proposition - works for
some instances fails for others... I've seen code that does this:

  if (public_api(arg1, arg2) == -1) {
      print some error message
      return failure
  }

even though it's been strongly suggested to do something like

   if (public_api(arg1, arg2) < 0) {
       get specific error (usually via errno) and perhaps
          make a decision based on that... but fall through to
       print some error message
       return failure
   }

> 
> I think when function spawns error when it comes to error code
> value is does not take much care as they are not really semantic
> (at least such common error as invalid operation) 
> so one day the caller can catch 'operation invalid' for different
> reason from different place on the stack.
> 
> As to virDomainBlockJobAbort it is easy to patch)
> 
> I think I should also remove reporting error from qemuMonitorJSONBlockJobError 
> in case of DeviceNotActive however it is used from several places even
> like setting speed. It is legacy of commit b976165c when
> a lot of qemu commands share common error checking code. I think I'd
> better copy this code to qemuMonitorJSONBlockJobCancel and change it there
> instead of common place. What do you think?
> 

I think you need to think about all the callers of
qemuMonitorJSONBlockJobError...  I almost hate to use the word fragile
to describe things, but there is a lot of complexity involved with
respect to all the consumers. I just recall hearing about someone doing
a lot of muttering while implementing the code. The snapshot code in
particular (which you seem to have copied to a degree) has been the
source of some fairly "exotic" bug reports.

I personally don't have a lot of insight into the totality of the
blockdev* code. I'd need to spend some time to really understand things
better and provide a more clear/concise answer...

John

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list