Hi Jaka, On Wed, Nov 24, 2021 at 4:11 PM Jaka Močnik <jaka@xxxxxxxxx> wrote: > hi, > > running an octopus cluster (upgraded from nautilus a few months ago) of > some 0.5PB capacity. it is used exclusively as an object storage via > rgw (clients use the swift API), 6 rgw instances are used to cater to > this. the cluster has been running for a bit over two years. > > it is subject to quite a heavy delete load (think in the order of > magnitude of 1M deletes per day). > > until recently this was handled w/o any problems, however, some 10 days > ago, our monitoring alerted us that the rgw gc queue was holding some > 20k rgw objects dispersed over ~700k rados objects. while such peaks > were common before, they were usually cleared very quickly. however, > this situation has not cleared since. in fact every day, some 100-200k > extra rados objects are added to the gc queue. > > after a bit of investigation it turned out that many of the objects in > the gc queue were already garbage collected. i.e. rgw has deleted them > from the rados rgw data pool, but has failed to remove them from the gc > queue. > > How did you diagnose this? the logs (debug_rgw = 20) do not show anything unusual. deletes > succeed. even when deleting an already deleted rgw object (i.e. its > rados tail objects), there are no complaints in the log (even though > deletes of rados objects must fail as the objects are not present > anymore). however, even after n-th deletion, the objects are not > removed from the gc queue. > > so, can someone help with the following: > - any pointers on where to start debugging this? I am at a loss since > rgws seems happy enough according to the logs. > - any ideas on how to remedy this situation? it will become a problem > in a week or two, according to the trends. > Have you tried running radosgw-admin gc list command? Are some entries always there, past their expiration time? There is a flag --include-all which can also be used to list all expired and unexpired entries. Also in the logs - do you see this "RGWGC::process removing entries, marker: "? Are the markers getting repeated? with regard to remedy in case we cannot diagnose the cause and fix it > soon enough, I was thinking about: > - stopping deletes to rgws for a short while, > - dumping the gc queue contents, > - stopping rgws, > - clearing or recreating the rgw gc queue structures on rados pools, > - restarting rgws and deletes, > - manually deleting the rados objects in the old gc queue dump. > > is that a sound plan? > > if so, what exactly does the "clearing or recreating the rgw gc queue > structures on rados pools" entail? > > I am under the impression that the gc queue is stored in gc.<number> > objects in the GC namespace in the default.rgw.log pool. > > would just deleting these and starting rgw do the trick? or do I need > to somehow recreate empty objects in their stead? > > Have you tried using the command: radosgw-admin gc process, to clear the expired entries and with --include-all to clear all entries? > best regards, > Jaka > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx