[ceph] [nautilus][ceph-ansible] - Dynamic bucket resharding problem

"Erik Johansson" <erik.johansson@xxxxxxxxxxxxxx> · Tue, 21 Jul 2020 14:34:48 +0200

Hello!

I've run into a bit of an issue with one of our radosgw production clusters..

Setup is two radosgw nodes behind haproxy loadbalancing, which in turn are connected to the ceph cluster. Everything running 14.2.2 so Nautilus. It's tied to a openstack cluster, so keystone as authentication backend (should really matter though).

Today both rgw backends crashed. Checking logs it seems to be related to dynamic resharding of a bucket, causing Lock errors:

Logs snippet: https://pastebin.com/uBCnhinF

Checking http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021368.html (old), I performed a manual reshard of affected bucket with success (radosgw-admin bucket reshard --bucket="XXX/YYY" --num-shards=256)

Checking the metadata for bucket, it now correctly shows 256, up from 128.

HOWEVER, the dynamic resharding still kept happening and bringing down the backeds. I suspect it is because of the old reshard op hanging around when checking a `reshard list`: https://pastebin.com/dPChwBCT

As the resharding seems to have been successful when running manually, I now want to remove that reshard op, but can't, getting this https://pastebin.com/071kfAsa error when trying..

Right now I had to resort to setting rgw_dynamic_resharding = false in ceph.conf to stop the problem from occuring.

Ideas? 

Cheers
Erik

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx