Re: RGWReshardLock::lock failed to acquire lock ret=-16

Josh Haft <paccrap@xxxxxxxxx> · Thu, 12 Mar 2020 12:27:21 -0500

Any thoughts on this? We just experienced this again last night. Our 3
RGW servers had issues servicing requests for approx 7 minutes while
this reshard happened. Our users received 5xx errors from haproxy
which fronts the RGW instances. Haproxy is configured with a backend
server timeout of 60 seconds and logged a couple thousand connections
with terminating code 'sH--', indicating the RGWs did not return
response headers within that time.

This is especially concerning because it happens on many buckets, not
just the one currently being resharded.

I am testing Nautilus on our dev cluster, are there any known fixes
for this issue included?

Regards,
Josh

On Thu, Oct 31, 2019 at 2:43 PM Josh Haft <paccrap@xxxxxxxxx> wrote:
>
> Hi,
>
> Currently running Mimic 13.2.5.
>
> We had reports this morning of timeouts and failures with PUT and GET
> requests to our Ceph RGW cluster. I found these messages in the RGW
> log:
> RGWReshardLock::lock failed to acquire lock on
> bucket_name:bucket_instance ret=-16
> NOTICE: resharding operation on bucket index detected, blocking
> block_while_resharding ERROR: bucket is still resharding, please retry
>
> Which were preceded by many of these, which I think are normal/expected.
> check_bucket_shards: resharding needed: stats.num_objects=6415879
> shard max_objects=6400000
>
> Our RGW cluster sits behind haproxy which notified me approx 90
> seconds after the first 'resharding needed' message that no backends
> were available. It appears this dynamic reshard process caused the
> RGWs to lock up for a period of time. Roughly 2 minutes later the
> reshard error messages stop and operation returns to normal.
>
> Looking back through previous RGW logs, I see a similar event from
> about a week ago, on the same bucket. We have several buckets with
> shard counts exceeding 1k (this one only has 128), and much larger
> object counts, so clearly this isn't the first time dynamic sharding
> has been invoked on this cluster.
>
> Has anyone seen this? I expect it will come up again, and can turn up
> debugging if that'll help. Thanks for any assistance!
> Josh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx