Re: sync errors are not cleared

Yuval Lifshitz <ylifshit@xxxxxxxxxx> · Thu, 3 Feb 2022 10:01:59 +0200

+ceph-devel 

On Wed, Feb 2, 2022 at 10:56 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
On Wed, Feb 2, 2022 at 8:36 AM Yuval Lifshitz <ylifshit@xxxxxxxxxx> wrote:

>

> i do see sync errors with "ERR_BUSY_RESHARDING": https://0x0.st/oH3D.json

> after dynamic reshard happened mid-sync, even though sync was finished successfully.

>

> is this expected?

those errors are possible, but i wouldn't say expected. 

but shouldn't the errors get cleared after the objects were successfully synced?

if fetch_remote_obj() is returning this error, that seems to imply that

RGWRados::guard_reshard() retried the index operation

NUM_RESHARD_RETRIES=10 times and still found it locked for resharding.

and after each try, guard_reshard() calls

RGWRados::block_while_resharding(), which has its own retry loop with

num_retries=10 that polls the reshard status then sleeps 5 seconds

with reshard_wait->wait()

if my understanding is correct, that would mean that the successful

reshard took over ~500 seconds to complete? or something under

guard_reshard() isn't working right

it looks like there is a problem. when I look at the client that uploads the objects to the primary it gets stalled for about 10 seconds, while the reshard is happening. however, the 2ndary sync process is stalled for a much longer period, until it successfully syncs

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx