every rgw stuck on "RGWReshardLock::lock found lock"

"Haas, Josh" <jhaas@xxxxxxxxxx> · Fri, 7 Oct 2022 21:36:16 +0000

I've observed this occur on v14.2.22 and v15.2.12. Wasn't able to find anything obviously relevant in changelogs, bug tickets, or existing mailing list threads.

In both cases, every RGW in the cluster starts spamming logs with lines that look like the following:

2022-09-04 14:20:45.231 7fc7b28c7700  0 INFO: RGWReshardLock::lock found lock on $BUCKET:c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.582171072.3067 to be held by another RGW process; skipping for now
2022-09-04 14:20:45.281 7fc7ca0f6700  0 block_while_resharding ERROR: bucket is still resharding, please retry
2022-09-04 14:20:45.283 7fc7ca0f6700  0 NOTICE: resharding operation on bucket index detected, blocking

The buckets in question were growing very quickly (hundreds of uploads per second, in the ballpark of 10 million objects when the bug hit), so it makes sense they got picked up for resharding. What doesn't make sense is every rgw stopping all other processing (not responding over HTTP) and logging nothing but these locking messages. Something seems to be going pretty badly wrong if we're not just backing off the lock and trying again later. Only one rgw should be trying to reshard a bucket at a time, right?

The other weird part is that it cycled between a complete outage for ~7.5 minutes, followed by responding to a low volume of requests for a couple minutes. Here you can see the outage in terms of HTTP status codes logged by our frontends for the second occurrence (aggregated across all rgws in the cluster):

https://jhaas.us-east-1.linodeobjects.com/public/rgw-lock/http-codes.jpg<https://jhaas.us-east-1.linodeobjects.com/public/rgw-lock/http-codes.jpg>

We can see the same trend (though basically reversed) if I graph the frequency of all log lines containing "starting new request" which should happen any time the rgw begins servicing a new request:

https://jhaas.us-east-1.linodeobjects.com/public/rgw-lock/starting-new-request.jpg<https://jhaas.us-east-1.linodeobjects.com/public/rgw-lock/starting-new-request.jpg>

I don't have an explanation for that; 7.5 minutes is ~450 seconds which doesn't sound like a default timeout or something to me. After all that was over, the buckets appear to have resharded successfully, and my current assumption is the issue resolved by itself once the resharding operation completed.

I'll be trying to reliably reproduce this or observe it more closely in the wild, hopefully on v17, but was hoping someone might have some insight in the meantime.

Thanks,

Josh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx