This is similar to an old thread https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/FXVEWDU6NYCGEY5QB6IGGQXTUEZAQKNY/ but I don't see any responses there so opening this one. PROBLEM DESCRIPTION * Issue is seen on versioned buckets. * Using extended logging (debug level 5), we can see that LC deletes expired (non-current) versions of objects but those versions are still to be found in the bucket: both listed in the bucket index and accessible to the user. AN EXAMPLE OBJECT/VERSION * Bucket: bald * Object: adc/certs/injr-f5lb01b-vadc04.json * Versions: It currently has 6 versions $ aws s3api list-object-versions --bucket=bald --prefix=adc/certs/injr-f5lb01b-vadc04.json jq -r '.Versions[] | [.Key, .VersionId, .LastModified] | @tsv' adc/certs/injr-f5lb01b-vadc04.json nBHrRDYZzuIrA0hORAIzh6QG8rzRF14 2024-06-28T21:13:00.014Z adc/certs/injr-f5lb01b-vadc04.json YGgH7VmZDq4M-j8qIrKq.4Valvvuoh4 2024-06-26T20:58:09.835Z adc/certs/injr-f5lb01b-vadc04.json qefseb1l.6WJyDNhH5buqX-qcZV2GAJ 2024-06-18T21:32:02.304Z adc/certs/injr-f5lb01b-vadc04.json s4YG598JEQC9A5jaJuI5S4XkCh1NRpN 2024-06-10T21:37:16.074Z adc/certs/injr-f5lb01b-vadc04.json z96LISOi8jBYHCrnbqHgNAAsnpAqbXm 2024-06-07T01:15:21.802Z adc/certs/injr-f5lb01b-vadc04.json Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp 2024-05-30T19:45:19.726Z Looking at the oldest version with ID Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp: LC actually deletes (or tries to delete?) it at several different occasions -- LC runs daily at midnight UTC. For example, . . . 2024-07-04T00:00:02.711+0000 7fd30f989700 2 lifecycle: DELETED::bald[1eeb7b2c-aaab-4dff-be19-be27acab9e85.352350675.1034]):adc/certs/injr-f5lb01b-vadc04.json[Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp] (non-current expiration) wp_thrd: 1, 0 . . 2024-07-08T00:00:04.989+0000 7f199a7ac700 2 lifecycle: DELETED::bald[1eeb7b2c-aaab-4dff-be19-be27acab9e85.352350675.1034]):adc/certs/injr-f5lb01b-vadc04.json[Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp] (non-current expiration) wp_thrd: 2, 3 . . 2024-07-09T00:00:02.671+0000 7f5a23cea700 2 lifecycle: DELETED::bald[1eeb7b2c-aaab-4dff-be19-be27acab9e85.352350675.1034]):adc/certs/injr-f5lb01b-vadc04.json[Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp] (non-current expiration) wp_thrd: 0, 4 However, as seen in the above in aws-cli output, this version is still there. Below is the output when we retrieve this exact version: $ aws s3api get-object --bucket=bald --key=adc/certs/injr-f5lb01b-vadc04.json --version-id=Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp /tmp/outfile { "AcceptRanges": "bytes", "Expiration": "expiry-date=\"Sat, 01 Jun 2024 00:00:00 GMT\", rule-id=\"delete-prior-versions\"", "LastModified": "Thu, 30 May 2024 19:45:19 GMT", "ContentLength": 1299, "ETag": "\"d9c9ff538f4e2f1435746d16cd9e62c8\"", "VersionId": "Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp", "ContentType": "binary/octet-stream", "Metadata": {} } and in radosgw-admin bucket list: { "name": "adc/certs/injr-f5lb01b-vadc04.json", "instance": "Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp", "ver": { "pool": 15, "epoch": 563937 }, "locator": "", "exists": true, "meta": { "category": 1, "size": 1299, "mtime": "2024-05-30T19:45:19.726701Z", "etag": "d9c9ff538f4e2f1435746d16cd9e62c8", "storage_class": "", ... "content_type": "", "accounted_size": 1299, "user_data": "", "appendable": false }, "tag": "1eeb7b2c-aaab-4dff-be19-be27acab9e85.1228454711.3959044872362449539", "flags": 1, "pending_map": [], "versioned_epoch": 85 }, SOME MORE NOTES ON THIS BUCKET & OBJECT * The current object is not deleted: no delete-marker. * No object locking is configured for this bucket. * I don't see any trace of this bucket or object in gc list. * The bucket has 101 shards and each shard has around ~30K objects so there's no noticeable skewness in distribution of objects across bucket index. However, I see below ERROR lines streaming when listing the bucket. Not sure it'd be relevant to the LC issue. ... 2024-07-09T18:28:32.470+0000 7f8ef5db0740 0 ERROR: list_objects_ordered marker failed to make forward progress; attempt=4, prev_marker=lb_summary.lock[s2BIhE0HnXnj.yONcP6.T-dkwU-aWhn], cur_marker=lb_summary.lock[njFF6AkNKUoCVIsi-6pJVhDyaK8FycS] 2024-07-09T18:28:32.530+0000 7f8ef5db0740 0 ERROR: list_objects_ordered marker failed to make forward progress; attempt=2, prev_marker=lb_summary.lock[njFF6AkNKUoCVIsi-6pJVhDyaK8FycS], cur_marker=lb_summary.lock[GDUSDyTB4nGsjT0GfDYlnB5zrB8UnSV] 2024-07-09T18:28:32.546+0000 7f8ef5db0740 0 ERROR: list_objects_ordered marker failed to make forward progress; attempt=3, prev_marker=lb_summary.lock[U4cpWSv2b5bI8rm1vsDw.kcXmXrDYuV], cur_marker=lb_summary.lock[2NCtXN7KbO0ypy4CSmJCk1gGZEnhWfL] ... * We ran "bucket check --fix" on the bucket a few days ago but it didn't resolve the LC issue or "failed to make forward progress" error stream during bucket listing. * Bucket stats for reference: $ radosgw-admin bucket stats --bucket=bald | jq '. | .bucket, [.num_shards, .usage]' "bald" [ 101, { "rgw.main": { "size": 13065121441, "size_actual": 16259469312, "size_utilized": 13065121441, "size_kb": 12758908, "size_kb_actual": 15878388, "size_kb_utilized": 12758908, "num_objects": 1233984 } } ] QUESTIONS * Is this by any chance a known issue? I searched the tracker but couldn't find a duplicate. * Any ideas why the deletes initiated by LC might fail silently? I don't see any indication of gc queue getting full around the time. * Any ideas on debugging this issue further? Would log-level 20 be helpful and/or any other log lines to look for? _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx