Re: GCed (as in tail objects already deleted from the data pool) objects remain in the GC queue forever

Pritha Srivastava <prsrivas@xxxxxxxxxx> · Wed, 24 Nov 2021 16:41:15 +0530

Hi Jaka,

On Wed, Nov 24, 2021 at 4:11 PM Jaka Močnik <jaka@xxxxxxxxx> wrote:

> hi,
>
> running an octopus cluster (upgraded from nautilus a few months ago) of
> some 0.5PB capacity. it is used exclusively as an object storage via
> rgw (clients use the swift API), 6 rgw instances are used to cater to
> this. the cluster has been running for a bit over two years.
>
> it is subject to quite a heavy delete load (think in the order of
> magnitude of 1M deletes per day).
>
> until recently this was handled w/o any problems, however, some 10 days
> ago, our monitoring alerted us that the rgw gc queue was holding some
> 20k rgw objects dispersed over ~700k rados objects. while such peaks
> were common before, they were usually cleared very quickly. however,
> this situation has not cleared since. in fact every day, some 100-200k
> extra rados objects are added to the gc queue.
>
> after a bit of investigation it turned out that many of the objects in
> the gc queue were already garbage collected. i.e. rgw has deleted them
> from the rados rgw data pool, but has failed to remove them from the gc
> queue.
>
> How did you diagnose this?

the logs (debug_rgw = 20) do not show anything unusual. deletes
> succeed. even when deleting an already deleted rgw object (i.e. its
> rados tail objects), there are no complaints in the log (even though
> deletes of rados objects must fail as the objects are not present
> anymore). however, even after n-th deletion, the objects are not
> removed from the gc queue.
>
>
so, can someone help with the following:
> - any pointers on where to start debugging this? I am at a loss since
> rgws seems happy enough according to the logs.
> - any ideas on how to remedy this situation? it will become a problem
> in a week or two, according to the trends.
>

Have you tried running radosgw-admin gc list command? Are some entries
always there, past their expiration time? There is a flag --include-all
which can also be used to list all expired and unexpired entries.
Also in the logs - do you see this "RGWGC::process removing entries,
marker: "? Are the markers getting repeated?

with regard to remedy in case we cannot diagnose the cause and fix it
> soon enough, I was thinking about:
> - stopping deletes to rgws for a short while,
> - dumping the gc queue contents,
> - stopping rgws,
> - clearing or recreating the rgw gc queue structures on rados pools,
> - restarting rgws and deletes,
> - manually deleting the rados objects in the old gc queue dump.
>
> is that a sound plan?
>
> if so, what exactly does the "clearing or recreating the rgw gc queue
> structures on rados pools" entail?
>
> I am under the impression that the gc queue is stored in gc.<number>
> objects in the GC namespace in the default.rgw.log pool.
>
> would just deleting these and starting rgw do the trick? or do I need
> to somehow recreate empty objects in their stead?
>
> Have you tried using the command: radosgw-admin gc process, to clear the
expired entries and with --include-all to clear all entries?

> best regards,
>   Jaka
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx