Re: RGW-coroutine: what if a spawned stack fails?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 03/26/2018 08:16 AM, Xinying Song wrote:
Hi, Cephers:

What if a spawned stack fails?

In RGWCoroutinesManager::run(), the loop traversing stacks will not
break when a spawned stack fails and this function just return the
last stack's operate value or zero when any blocked stack occurs.

In drain_all() macro, the failure info of a spawned stack is discarded.

Only RGWCoroutine::collect() function cares about stack's return
value, but this function can not capture all stack's failure info.


For example, in RGWInitSyncStatusCoroutine::operate(), the drain_all()
macro is called after a collect() function. If a  spawned stack
finishes after the collect() function and fails, that failure info is
lost.

Is this an anticipated behavior?
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hi,

You're right that drain_all() discards errors from spawned coroutines. It's often used in error paths, where we already know which error code we're going to return, and just want to clean up everything else that was in progress.

For InitSyncStatus in particular, I don't think that drain_all() at the end is really necessary. In the two yield blocks above that, one is using spawn() with wait=true (which means that it yields until all spawned coroutines complete), and the second one is using call() (which also yields until the coroutine completes). So the only spawned coroutine that should be running after those yields is the lease_cr, which we stop before calling collect() - so it shouldn't be possible for any spawned stacks to finish after collect() there.

However, if you look a bit higher in the function, there's a call to 'drain_all_but_stack(lease_stack.get())'. The coroutines spawned in the "fetching remote log position" section call spawn() with wait=false, so any errors from RGWReadRemoteMDLogShardInfoCR -will- be ignored by drain_all_but_stack(). In this case, that's because we often see ENOENT errors (because the remote shard is actually empty), and just want to use the default/empty shard info for those.

At a higher level, consider that you spawn 4 child coroutines that come back with 4 different error codes - how do you map those into a single error code that gets returned to the parent? The answer is likely to be different depending on the context, so there isn't a general solution that works everywhere.

So I agree that there are places where we use these drain macros that could benefit from more specific error handling, and I'd welcome tests and pull requests to help improve on that.

Hope that helps,
Casey
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux