Re: RGW-coroutine: what if a spawned stack fails?

Casey Bodley <cbodley@xxxxxxxxxx> · Mon, 26 Mar 2018 16:00:48 -0400

On 03/26/2018 08:16 AM, Xinying Song wrote:
Hi, Cephers:

What if a spawned stack fails?

In RGWCoroutinesManager::run(), the loop traversing stacks will not
break when a spawned stack fails and this function just return the
last stack's operate value or zero when any blocked stack occurs.

In drain_all() macro, the failure info of a spawned stack is discarded.

Only RGWCoroutine::collect() function cares about stack's return
value, but this function can not capture all stack's failure info.

For example, in RGWInitSyncStatusCoroutine::operate(), the drain_all()
macro is called after a collect() function. If a  spawned stack
finishes after the collect() function and fails, that failure info is
lost.

Is this an anticipated behavior?
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hi,

You're right that drain_all() discards errors from spawned coroutines. 
It's often used in error paths, where we already know which error code 
we're going to return, and just want to clean up everything else that 
was in progress.

For InitSyncStatus in particular, I don't think that drain_all() at the 
end is really necessary. In the two yield blocks above that, one is 
using spawn() with wait=true (which means that it yields until all 
spawned coroutines complete), and the second one is using call() (which 
also yields until the coroutine completes). So the only spawned 
coroutine that should be running after those yields is the lease_cr, 
which we stop before calling collect() - so it shouldn't be possible for 
any spawned stacks to finish after collect() there.

However, if you look a bit higher in the function, there's a call to 
'drain_all_but_stack(lease_stack.get())'. The coroutines spawned in the 
"fetching remote log position" section call spawn() with wait=false, so 
any errors from RGWReadRemoteMDLogShardInfoCR -will- be ignored by 
drain_all_but_stack(). In this case, that's because we often see ENOENT 
errors (because the remote shard is actually empty), and just want to 
use the default/empty shard info for those.

At a higher level, consider that you spawn 4 child coroutines that come 
back with 4 different error codes - how do you map those into a single 
error code that gets returned to the parent? The answer is likely to be 
different depending on the context, so there isn't a general solution 
that works everywhere.

So I agree that there are places where we use these drain macros that 
could benefit from more specific error handling, and I'd welcome tests 
and pull requests to help improve on that.

Hope that helps,
Casey
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html