Hi everybody,
we upgraded our containerized Red Hat Pacific cluster to the latest
Quincy release (Community Edition).
The upgrade itself went fine, the cluster is HEALTH_OK, all daemons run
the upgraded version:
---- %< ----
$ ceph -s
cluster:
id: 68675a58-cf09-4ebd-949c-b9fcc4f2264e
health: HEALTH_OK
services:
mon: 5 daemons, quorum node02,node03,node04,node05,node01 (age 25h)
mgr: node03.ztlair(active, since 25h), standbys: node01.koymku,
node04.uvxgvp, node02.znqnhg, node05.iifmpc
osd: 408 osds: 408 up (since 22h), 408 in (since 7d)
rgw: 19 daemons active (19 hosts, 1 zones)
data:
pools: 11 pools, 8481 pgs
objects: 236.99M objects, 544 TiB
usage: 1.6 PiB used, 838 TiB / 2.4 PiB avail
pgs: 8385 active+clean
79 active+clean+scrubbing+deep
17 active+clean+scrubbing
io:
client: 42 MiB/s rd, 439 MiB/s wr, 2.15k op/s rd, 1.64k op/s wr
---
$ ceph versions | jq .overall
{
"ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy
(stable)": 437
}
---- >% ----
After all the daemons were upgraded we started noticing some RGW buckets
which are inaccessible.
s3cmd failed with NoSuchKey:
---- %< ----
$ s3cmd la -l
ERROR: S3 error: 404 (NoSuchKey)
---- >% ----
The buckets still exists according to "radosgw-admin bucket list".
Out of the ~600 buckets, 13 buckets are unaccessible at the moment:
---- %< ----
$ radosgw-admin bucket radoslist --tenant xy --uid xy --bucket xy
2024-04-03T12:13:40.607+0200 7f0dbf4c4680 0 int
RGWRados::cls_bucket_list_ordered(const DoutPrefixProvider*,
RGWBucketInfo&, int, const rgw_obj_index_key&, const string&, const
string&, uint32_t, bool, uint16_t, RGWRados::ent_map_t&, bool*, bool*,
rgw_obj_index_key*, optional_yield, RGWBucketListNameFilter):
CLSRGWIssueBucketList for
xy:xy[6955f50e-5b23-4534-9b77-c7078f60f0d0.171713434.3]) failed
2024-04-03T12:13:40.609+0200 7f0dbf4c4680 0 int
RGWRados::cls_bucket_list_ordered(const DoutPrefixProvider*,
RGWBucketInfo&, int, const rgw_obj_index_key&, const string&, const
string&, uint32_t, bool, uint16_t, RGWRados::ent_map_t&, bool*, bool*,
rgw_obj_index_key*, optional_yield, RGWBucketListNameFilter):
CLSRGWIssueBucketList for
xy:xy[6955f50e-5b23-4534-9b77-c7078f60f0d0.171713434.3]) failed
---- >% ----
The affected buckets are comparatively large, around 4 - 7 TB,
but not all buckets of that size are affected.
Using "rados -p rgw.buckets.data ls" it seems like all the objects are
still there,
although "rados -p rgw.buckets.data get objectname -" only prints
unusable (?) binary data,
even for objects of intact buckets.
Overall we're facing around 60 TB of customer data which are just gone
at the moment.
Is there a way to recover from this situation or further narrowing down
the root cause of the problem?
Kind regards,
Lorenz
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx