[RGW][Lifecycle][Versioned Buckets][Reef] Although LC deletes non-current versions, they still exist

"oguzhan ozmen" <oozmen@xxxxxxxxxxxxx> · Tue, 09 Jul 2024 18:43:51 -0000

This is similar to an old thread https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/FXVEWDU6NYCGEY5QB6IGGQXTUEZAQKNY/ but I don't see any responses there so opening this one.

PROBLEM DESCRIPTION

* Issue is seen on versioned buckets.
* Using extended logging (debug level 5), we can see that LC deletes expired (non-current) versions of objects but those versions are still found in the bucket: both seen in bucket index and accessible to the user.

AN EXAMPLE OBJECT/VERSION

* Bucket: bald
* Object: adc/certs/injr-f5lb01b-vadc04.json
* Versions: It currently has 6 versions

  $ aws s3api list-object-versions --bucket=bald --prefix=adc/certs/injr-f5lb01b-vadc04.json jq -r '.Versions[] | [.Key, .VersionId, .LastModified] | @tsv'
  adc/certs/injr-f5lb01b-vadc04.json      nBHrRDYZzuIrA0hORAIzh6QG8rzRF14  2024-06-28T21:13:00.014Z
  adc/certs/injr-f5lb01b-vadc04.json      YGgH7VmZDq4M-j8qIrKq.4Valvvuoh4   2024-06-26T20:58:09.835Z
  adc/certs/injr-f5lb01b-vadc04.json      qefseb1l.6WJyDNhH5buqX-qcZV2GAJ  2024-06-18T21:32:02.304Z
  adc/certs/injr-f5lb01b-vadc04.json      s4YG598JEQC9A5jaJuI5S4XkCh1NRpN 2024-06-10T21:37:16.074Z 
  adc/certs/injr-f5lb01b-vadc04.json      z96LISOi8jBYHCrnbqHgNAAsnpAqbXm 2024-06-07T01:15:21.802Z
  adc/certs/injr-f5lb01b-vadc04.json      Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp 2024-05-30T19:45:19.726Z

Looking at the oldest version with ID Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp ... LC actually deletes (or tries to delete?) it at several different occasions -- LC runs daily at midnight UTC. For example,
.
.
.
2024-07-04T00:00:02.711+0000 7fd30f989700  2 lifecycle: DELETED::bald[1eeb7b2c-aaab-4dff-be19-be27acab9e85.352350675.1034]):adc/certs/injr-f5lb01b-vadc04.json[Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp] (non-current expiration) wp_thrd: 1, 0
.
.
2024-07-08T00:00:04.989+0000 7f199a7ac700  2 lifecycle: DELETED::bald[1eeb7b2c-aaab-4dff-be19-be27acab9e85.352350675.1034]):adc/certs/injr-f5lb01b-vadc04.json[Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp] (non-current expiration) wp_thrd: 2, 3
.
.
2024-07-09T00:00:02.671+0000 7f5a23cea700  2 lifecycle: DELETED::bald[1eeb7b2c-aaab-4dff-be19-be27acab9e85.352350675.1034]):adc/certs/injr-f5lb01b-vadc04.json[Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp] (non-current expiration) wp_thrd: 0, 4

However, as seen in the above in aws-cli output, this version is still there. Below is the output when we retrieve this exact version:

$ aws s3api get-object --bucket=bald --key=adc/certs/injr-f5lb01b-vadc04.json --version-id=Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp /tmp/outfile
{
    "AcceptRanges": "bytes",
    "Expiration": "expiry-date=\"Sat, 01 Jun 2024 00:00:00 GMT\", rule-id=\"delete-prior-versions\"",
    "LastModified": "Thu, 30 May 2024 19:45:19 GMT",
    "ContentLength": 1299,
    "ETag": "\"d9c9ff538f4e2f1435746d16cd9e62c8\"",
    "VersionId": "Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp",
    "ContentType": "binary/octet-stream",
    "Metadata": {}
}

and in radosgw-admin bucket list:

    {
        "name": "adc/certs/injr-f5lb01b-vadc04.json",
        "instance": "Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp",
        "ver": {
            "pool": 15,
            "epoch": 563937
        },
        "locator": "",
        "exists": true,
        "meta": {
            "category": 1,
            "size": 1299,
            "mtime": "2024-05-30T19:45:19.726701Z",
            "etag": "d9c9ff538f4e2f1435746d16cd9e62c8",
            "storage_class": "",
            ...
            "content_type": "",
            "accounted_size": 1299,
            "user_data": "",
            "appendable": false
        },
        "tag": "1eeb7b2c-aaab-4dff-be19-be27acab9e85.1228454711.3959044872362449539",
        "flags": 1,
        "pending_map": [],
        "versioned_epoch": 85
    },

SOME MORE NOTES ON THIS BUCKET & OBJECT

* The current object is not deleted: no delete-marker.
* No object locking is configured for this bucket.
* I don't see any trace of this bucket or object in gc list.
* The bucket has 101 shards and each shard has around ~30K objects so there's no noticeable skewness in distribution of objects across bucket index. However, I see below ERROR lines streaming when listing the bucket. Not sure it'd be relevant to the LC issue.
    ...
    2024-07-09T18:28:32.470+0000 7f8ef5db0740  0 ERROR: list_objects_ordered marker failed to make forward progress; attempt=4, prev_marker=lb_summary.lock[s2BIhE0HnXnj.yONcP6.T-dkwU-aWhn], cur_marker=lb_summary.lock[njFF6AkNKUoCVIsi-6pJVhDyaK8FycS]
    2024-07-09T18:28:32.530+0000 7f8ef5db0740  0 ERROR: list_objects_ordered marker failed to make forward progress; attempt=2, prev_marker=lb_summary.lock[njFF6AkNKUoCVIsi-6pJVhDyaK8FycS], cur_marker=lb_summary.lock[GDUSDyTB4nGsjT0GfDYlnB5zrB8UnSV]
    2024-07-09T18:28:32.546+0000 7f8ef5db0740  0 ERROR: list_objects_ordered marker failed to make forward progress; attempt=3, prev_marker=lb_summary.lock[U4cpWSv2b5bI8rm1vsDw.kcXmXrDYuV], cur_marker=lb_summary.lock[2NCtXN7KbO0ypy4CSmJCk1gGZEnhWfL]
    ...

* We ran "bucket check --fix" on the bucket a few days ago but it didn't resolve the LC issue or "failed to make forward progress" error stream during bucket listing.
* Bucket stats for reference:
   $ radosgw-admin bucket stats --bucket=bald | jq '. | .bucket, [.num_shards, .usage]'
   "bald"
    [
      101,
      {
        "rgw.main": {
           "size": 13065121441,
            "size_actual": 16259469312,
            "size_utilized": 13065121441,
            "size_kb": 12758908,
             "size_kb_actual": 15878388,
             "size_kb_utilized": 12758908,
             "num_objects": 1233984
         }
      }
    ]

QUESTIONS

* Is this by  any chance a known issue? I searched the tracker but couldn't find a duplicate.
* Any ideas why the deletes initiated by LC might fail silently? I don't see any indication of gc queue getting full around the time.
* Any ideas on debugging this issue further? Would log-level 20 be helpful and/or any other log lines to look for?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx