Re: ceph multisite lifecycle not working

Casey Bodley <cbodley@xxxxxxxxxx> · Mon, 9 Dec 2024 10:16:54 -0500

hi Chris,

https://docs.ceph.com/en/latest/radosgw/dynamicresharding/#lifecycle-fixes
may be relevant here. there's a `radosgw-admin lc reshard fix` command
that you can run on the secondary site to add buckets back to the lc
list. if you omit the --bucket argument, it should scan all buckets
and re-link everything with a lifecycle policy

On Fri, Dec 6, 2024 at 5:04 PM Christopher Durham <caduceus42@xxxxxxx> wrote:
>
>
> I have 18.2.4 on Rocky 9 Linux.This system has been updated from octopus ->  pacific -> quincy (18.2.2) -> (el8->el9 reinstall of each server, but ceph osd and mon survival) -> reef (18.2.4) over several years.
>
> It appears that I have two probably related problems with lifecycle expiration in a multsite configuration.
> I have two zones, one on each side of a multisite. I recently discovered (about a month after the el9 and reef 18.2.4 updates) that lifecycle expiration was (mostly) not working on the secondary zone side.I had thought initially that there may be replication issues, but while there are replication issues on individual buckets that required me to full sync individual buckets, the majority of the issues are becauselifecycle expiration is not working on the secondary side.
> The observation that caused me to think lifecycle is the issue is that based on a lifecycle policy for a given bucket, all objects in that bucket should be already deleted.What we are seeing is that all objects have been deleted from the bucket on the master zone, but NONE of them have been deleted on the slave side.This may vary based on the date the objects were created across multiple lifecycle runs on the master side, but objects never get deleted/expired on the slave side.
> I tracked this down to one of two causes, let's say for a given bucket bucket1
>
> 1. radosgw-admin lc list on the master shows that the bucket completes its lifecycle processing periodically. But on the slave side, it shows:
> "started": "Thu, 01 Jan 1970 ...""status": "UNINITIAL"
> If I run:
> radosgw-admin lc process --bucket bucket1
> that particular bucket flushes all of its expired objects (takes awhile). But as far as I can tell at this point, it never runs lifecycle again on the slave side
>
> Now, let's say I have bucket2.
> 2. radosgw-admin lc list on the slave side does NOT show the bucket in the json output, yet the same command on the master side shows it!
>
> Given this, if I run
> radosgw-admin lc process --bucket2
> causes C++ exceptions and the command crashes on the slave side (makes sense, actually)
>
> Yet in this case if I do:
> aws --profile bucket2_owner s3api get-bucket-lifecycle-configuration --bucket bucket2
> it shows the lifecycle configuration for the bucket, regardless whether I point the awscli to the master or slave zone.
> In this case, if I redeploy the lifecycle with put-bucket-lifecycle-configuration to the master side, then thelifecycle status shows up in
> radosgw-admin lc list
> on the slave side (as well as on the master) as UNINITIAL, and this issue devolves to #1 above,
> Note that lifecycle expiration on the slave side does work for some number of buckets, but most remain in the UNINITIAL state, and others not there at all until Iredeploy the lifecycle. The slave side is a lot more active in reading and writing.
>
> So, why would the bucket not show up in lc list on the slave side, where it had before (I can't say how long ago 'before' was)?How can I get it to automatically perform lifecycle on the slave side? Would this perhaps be related to
>
> rgw_lc_max_workerrgw_lc_max_wp_workerrgw_lifecycle_work_time
> It appears that lifecycle processing is independent on each side, meaning that a lifecycle processing of bucket A on one side runs separately from lifecycle processing of bucket A on the other side, and as such an object may exist on one side for a time when it has been already deleted on the other side.
>
> How does rgw_lifecycle_work_time work? Does it mean that outside of the work_time window no new lifecyle processing starts, or that those in process abort/stop?
> Either way this may explain my observations as to too many buckets staying in UNINITIAL when those that are processing have a lot of data to delete.
> And why is this last one rgw_lifecycle_work_time and not rgw_lc_work_time?
> Anyway, any help on theses issues would be appreciated. Thanks
> \-Chris
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx