[RGW][Lifecycle][Versioned Buckets][Reef] Although LC deletes non-current

"Oguzhan Ozmen (BLOOMBERG/ 120 PARK)" <oozmen@xxxxxxxxxxxxx> · Wed, 10 Jul 2024 15:23:40 -0000

This is similar to an old thread https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/FXVEWDU6NYCGEY5QB6IGGQXTUEZAQKNY/ but I don't see any responses there so opening this one.

PROBLEM DESCRIPTION

* Issue is seen on versioned buckets.
* Using extended logging (debug level 5), we can see that LC deletes expired (non-current) versions of objects but those versions are still to be found in the bucket: both listed in the bucket index and accessible to the user.

AN EXAMPLE OBJECT/VERSION

* Bucket: bald
* Object: adc/certs/injr-f5lb01b-vadc04.json
* Versions: It currently has 6 versions

$ aws s3api list-object-versions --bucket=bald --prefix=adc/certs/injr-f5lb01b-vadc04.json jq -r '.Versions[] | [.Key, .VersionId, .LastModified] | @tsv'
adc/certs/injr-f5lb01b-vadc04.json      nBHrRDYZzuIrA0hORAIzh6QG8rzRF14  2024-06-28T21:13:00.014Z
adc/certs/injr-f5lb01b-vadc04.json      YGgH7VmZDq4M-j8qIrKq.4Valvvuoh4   2024-06-26T20:58:09.835Z
adc/certs/injr-f5lb01b-vadc04.json      qefseb1l.6WJyDNhH5buqX-qcZV2GAJ  2024-06-18T21:32:02.304Z
adc/certs/injr-f5lb01b-vadc04.json      s4YG598JEQC9A5jaJuI5S4XkCh1NRpN 2024-06-10T21:37:16.074Z
adc/certs/injr-f5lb01b-vadc04.json      z96LISOi8jBYHCrnbqHgNAAsnpAqbXm 2024-06-07T01:15:21.802Z
adc/certs/injr-f5lb01b-vadc04.json      Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp 2024-05-30T19:45:19.726Z

Looking at the oldest version with ID Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp: LC actually deletes (or tries to delete?) it at several different occasions -- LC runs daily at midnight UTC. For example,

.
.
.
2024-07-04T00:00:02.711+0000 7fd30f989700  2 lifecycle: DELETED::bald[1eeb7b2c-aaab-4dff-be19-be27acab9e85.352350675.1034]):adc/certs/injr-f5lb01b-vadc04.json[Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp] (non-current expiration) wp_thrd: 1, 0
.
.
2024-07-08T00:00:04.989+0000 7f199a7ac700  2 lifecycle: DELETED::bald[1eeb7b2c-aaab-4dff-be19-be27acab9e85.352350675.1034]):adc/certs/injr-f5lb01b-vadc04.json[Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp] (non-current expiration) wp_thrd: 2, 3
.
.
2024-07-09T00:00:02.671+0000 7f5a23cea700  2 lifecycle: DELETED::bald[1eeb7b2c-aaab-4dff-be19-be27acab9e85.352350675.1034]):adc/certs/injr-f5lb01b-vadc04.json[Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp] (non-current expiration) wp_thrd: 0, 4

However, as seen in the above in aws-cli output, this version is still there. Below is the output when we retrieve this exact version:

$ aws s3api get-object --bucket=bald --key=adc/certs/injr-f5lb01b-vadc04.json --version-id=Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp /tmp/outfile
{
"AcceptRanges": "bytes",
"Expiration": "expiry-date=\"Sat, 01 Jun 2024 00:00:00 GMT\", rule-id=\"delete-prior-versions\"",
"LastModified": "Thu, 30 May 2024 19:45:19 GMT",
"ContentLength": 1299,
"ETag": "\"d9c9ff538f4e2f1435746d16cd9e62c8\"",
"VersionId": "Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp",
"ContentType": "binary/octet-stream",
"Metadata": {}
}

and in radosgw-admin bucket list:

{
"name": "adc/certs/injr-f5lb01b-vadc04.json",
"instance": "Rfgi2NdGYWy.g7H6JevgqsDLXahSHJp",
"ver": {
"pool": 15,
"epoch": 563937
},
"locator": "",
"exists": true,
"meta": {
"category": 1,
"size": 1299,
"mtime": "2024-05-30T19:45:19.726701Z",
"etag": "d9c9ff538f4e2f1435746d16cd9e62c8",
"storage_class": "",
...
"content_type": "",
"accounted_size": 1299,
"user_data": "",
"appendable": false
},
"tag": "1eeb7b2c-aaab-4dff-be19-be27acab9e85.1228454711.3959044872362449539",
"flags": 1,
"pending_map": [],
"versioned_epoch": 85
},

SOME MORE NOTES ON THIS BUCKET & OBJECT

* The current object is not deleted: no delete-marker.
* No object locking is configured for this bucket.
* I don't see any trace of this bucket or object in gc list.
* The bucket has 101 shards and each shard has around ~30K objects so there's no noticeable skewness in distribution of objects across bucket index. However, I see below ERROR lines streaming when listing the bucket. Not sure it'd be relevant to the LC issue.
...

2024-07-09T18:28:32.470+0000 7f8ef5db0740  0 ERROR: list_objects_ordered marker failed to make forward progress; attempt=4, prev_marker=lb_summary.lock[s2BIhE0HnXnj.yONcP6.T-dkwU-aWhn], cur_marker=lb_summary.lock[njFF6AkNKUoCVIsi-6pJVhDyaK8FycS]

2024-07-09T18:28:32.530+0000 7f8ef5db0740  0 ERROR: list_objects_ordered marker failed to make forward progress; attempt=2, prev_marker=lb_summary.lock[njFF6AkNKUoCVIsi-6pJVhDyaK8FycS], cur_marker=lb_summary.lock[GDUSDyTB4nGsjT0GfDYlnB5zrB8UnSV]

2024-07-09T18:28:32.546+0000 7f8ef5db0740  0 ERROR: list_objects_ordered marker failed to make forward progress; attempt=3, prev_marker=lb_summary.lock[U4cpWSv2b5bI8rm1vsDw.kcXmXrDYuV], cur_marker=lb_summary.lock[2NCtXN7KbO0ypy4CSmJCk1gGZEnhWfL]
...

* We ran "bucket check --fix" on the bucket a few days ago but it didn't resolve the LC issue or "failed to make forward progress" error stream during bucket listing.
* Bucket stats for reference:
$ radosgw-admin bucket stats --bucket=bald | jq '. | .bucket, [.num_shards, .usage]'
"bald"
[
101,
{
"rgw.main": {
"size": 13065121441,
"size_actual": 16259469312,
"size_utilized": 13065121441,
"size_kb": 12758908,
"size_kb_actual": 15878388,
"size_kb_utilized": 12758908,
"num_objects": 1233984
}
}
]

QUESTIONS

* Is this by any chance a known issue? I searched the tracker but couldn't find a duplicate.
* Any ideas why the deletes initiated by LC might fail silently? I don't see any indication of gc queue getting full around the time.
* Any ideas on debugging this issue further? Would log-level 20 be helpful and/or any other log lines to look for?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx