Re: sync errors are not cleared

Casey Bodley <cbodley@xxxxxxxxxx> · Thu, 3 Feb 2022 09:47:35 -0500

On Thu, Feb 3, 2022 at 3:02 AM Yuval Lifshitz <ylifshit@xxxxxxxxxx> wrote:
>
> +ceph-devel
>
> On Wed, Feb 2, 2022 at 10:56 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>>
>> On Wed, Feb 2, 2022 at 8:36 AM Yuval Lifshitz <ylifshit@xxxxxxxxxx> wrote:
>> >
>> > i do see sync errors with "ERR_BUSY_RESHARDING": https://0x0.st/oH3D.json
>> > after dynamic reshard happened mid-sync, even though sync was finished successfully.
>> >
>> > is this expected?
>>
>> those errors are possible, but i wouldn't say expected.
>
>
> but shouldn't the errors get cleared after the objects were successfully synced?

maybe, but the only thing that currently clears entries from
'radosgw-admin sync error list' is 'radosgw-admin sync error trim'

>
>>
>> if fetch_remote_obj() is returning this error, that seems to imply that
>> RGWRados::guard_reshard() retried the index operation
>> NUM_RESHARD_RETRIES=10 times and still found it locked for resharding.
>> and after each try, guard_reshard() calls
>> RGWRados::block_while_resharding(), which has its own retry loop with
>> num_retries=10 that polls the reshard status then sleeps 5 seconds
>> with reshard_wait->wait()
>>
>> if my understanding is correct, that would mean that the successful
>> reshard took over ~500 seconds to complete? or something under
>> guard_reshard() isn't working right
>>
> it looks like there is a problem. when I look at the client that uploads the objects to the primary it gets stalled for about 10 seconds, while the reshard is happening. however, the 2ndary sync process is stalled for a much longer period, until it successfully syncs
>

ok. 10 seconds still seems rather long for a reshard, unless it's a big workload

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx