hi, running an octopus cluster (upgraded from nautilus a few months ago) of some 0.5PB capacity. it is used exclusively as an object storage via rgw (clients use the swift API), 6 rgw instances are used to cater to this. the cluster has been running for a bit over two years. it is subject to quite a heavy delete load (think in the order of magnitude of 1M deletes per day). until recently this was handled w/o any problems, however, some 10 days ago, our monitoring alerted us that the rgw gc queue was holding some 20k rgw objects dispersed over ~700k rados objects. while such peaks were common before, they were usually cleared very quickly. however, this situation has not cleared since. in fact every day, some 100-200k extra rados objects are added to the gc queue. after a bit of investigation it turned out that many of the objects in the gc queue were already garbage collected. i.e. rgw has deleted them from the rados rgw data pool, but has failed to remove them from the gc queue. the logs (debug_rgw = 20) do not show anything unusual. deletes succeed. even when deleting an already deleted rgw object (i.e. its rados tail objects), there are no complaints in the log (even though deletes of rados objects must fail as the objects are not present anymore). however, even after n-th deletion, the objects are not removed from the gc queue. so, can someone help with the following: - any pointers on where to start debugging this? I am at a loss since rgws seems happy enough according to the logs. - any ideas on how to remedy this situation? it will become a problem in a week or two, according to the trends. with regard to remedy in case we cannot diagnose the cause and fix it soon enough, I was thinking about: - stopping deletes to rgws for a short while, - dumping the gc queue contents, - stopping rgws, - clearing or recreating the rgw gc queue structures on rados pools, - restarting rgws and deletes, - manually deleting the rados objects in the old gc queue dump. is that a sound plan? if so, what exactly does the "clearing or recreating the rgw gc queue structures on rados pools" entail? I am under the impression that the gc queue is stored in gc.<number> objects in the GC namespace in the default.rgw.log pool. would just deleting these and starting rgw do the trick? or do I need to somehow recreate empty objects in their stead? best regards, Jaka _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx