On 03/26/2018 08:16 AM, Xinying Song wrote:
Hi, Cephers:
What if a spawned stack fails?
In RGWCoroutinesManager::run(), the loop traversing stacks will not
break when a spawned stack fails and this function just return the
last stack's operate value or zero when any blocked stack occurs.
In drain_all() macro, the failure info of a spawned stack is discarded.
Only RGWCoroutine::collect() function cares about stack's return
value, but this function can not capture all stack's failure info.
For example, in RGWInitSyncStatusCoroutine::operate(), the drain_all()
macro is called after a collect() function. If a spawned stack
finishes after the collect() function and fails, that failure info is
lost.
Is this an anticipated behavior?
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi,
You're right that drain_all() discards errors from spawned coroutines.
It's often used in error paths, where we already know which error code
we're going to return, and just want to clean up everything else that
was in progress.
For InitSyncStatus in particular, I don't think that drain_all() at the
end is really necessary. In the two yield blocks above that, one is
using spawn() with wait=true (which means that it yields until all
spawned coroutines complete), and the second one is using call() (which
also yields until the coroutine completes). So the only spawned
coroutine that should be running after those yields is the lease_cr,
which we stop before calling collect() - so it shouldn't be possible for
any spawned stacks to finish after collect() there.
However, if you look a bit higher in the function, there's a call to
'drain_all_but_stack(lease_stack.get())'. The coroutines spawned in the
"fetching remote log position" section call spawn() with wait=false, so
any errors from RGWReadRemoteMDLogShardInfoCR -will- be ignored by
drain_all_but_stack(). In this case, that's because we often see ENOENT
errors (because the remote shard is actually empty), and just want to
use the default/empty shard info for those.
At a higher level, consider that you spawn 4 child coroutines that come
back with 4 different error codes - how do you map those into a single
error code that gets returned to the parent? The answer is likely to be
different depending on the context, so there isn't a general solution
that works everywhere.
So I agree that there are places where we use these drain macros that
could benefit from more specific error handling, and I'd welcome tests
and pull requests to help improve on that.
Hope that helps,
Casey
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html