Upgraded to Quincy 17.2.7: some S3 buckets inaccessible

Lorenz Bausch <info@xxxxxxxxxxxxxxx> · Wed, 03 Apr 2024 17:57:37 +0200

Hi everybody,

we upgraded our containerized Red Hat Pacific cluster to the latest 
Quincy release (Community Edition).
The upgrade itself went fine, the cluster is HEALTH_OK, all daemons run 
the upgraded version:

---- %< ----
$ ceph -s
  cluster:
    id:     68675a58-cf09-4ebd-949c-b9fcc4f2264e
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum node02,node03,node04,node05,node01 (age 25h)
    mgr: node03.ztlair(active, since 25h), standbys: node01.koymku, 
node04.uvxgvp, node02.znqnhg, node05.iifmpc
    osd: 408 osds: 408 up (since 22h), 408 in (since 7d)
    rgw: 19 daemons active (19 hosts, 1 zones)

  data:
    pools:   11 pools, 8481 pgs
    objects: 236.99M objects, 544 TiB
    usage:   1.6 PiB used, 838 TiB / 2.4 PiB avail
    pgs:     8385 active+clean
             79   active+clean+scrubbing+deep
             17   active+clean+scrubbing

  io:
    client:   42 MiB/s rd, 439 MiB/s wr, 2.15k op/s rd, 1.64k op/s wr

---

$ ceph versions | jq .overall
{
  "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy 
(stable)": 437
}
---- >% ----

After all the daemons were upgraded we started noticing some RGW buckets 
which are inaccessible.
s3cmd failed with NoSuchKey:

---- %< ----
$ s3cmd la -l
ERROR: S3 error: 404 (NoSuchKey)
---- >% ----

The buckets still exists according to "radosgw-admin bucket list".
Out of the ~600 buckets, 13 buckets are unaccessible at the moment:

---- %< ----
$ radosgw-admin bucket radoslist --tenant xy --uid xy --bucket xy
2024-04-03T12:13:40.607+0200 7f0dbf4c4680  0 int 
RGWRados::cls_bucket_list_ordered(const DoutPrefixProvider*, 
RGWBucketInfo&, int, const rgw_obj_index_key&, const string&, const 
string&, uint32_t, bool, uint16_t, RGWRados::ent_map_t&, bool*, bool*, 
rgw_obj_index_key*, optional_yield, RGWBucketListNameFilter): 
CLSRGWIssueBucketList for 
xy:xy[6955f50e-5b23-4534-9b77-c7078f60f0d0.171713434.3]) failed
2024-04-03T12:13:40.609+0200 7f0dbf4c4680  0 int 
RGWRados::cls_bucket_list_ordered(const DoutPrefixProvider*, 
RGWBucketInfo&, int, const rgw_obj_index_key&, const string&, const 
string&, uint32_t, bool, uint16_t, RGWRados::ent_map_t&, bool*, bool*, 
rgw_obj_index_key*, optional_yield, RGWBucketListNameFilter): 
CLSRGWIssueBucketList for 
xy:xy[6955f50e-5b23-4534-9b77-c7078f60f0d0.171713434.3]) failed
---- >% ----

The affected buckets are comparatively large, around 4 - 7 TB,
but not all buckets of that size are affected.

Using "rados -p rgw.buckets.data ls" it seems like all the objects are 
still there,
although "rados -p rgw.buckets.data get objectname -" only prints 
unusable (?) binary data,
even for objects of intact buckets.

Overall we're facing around 60 TB of customer data which are just gone 
at the moment.
Is there a way to recover from this situation or further narrowing down 
the root cause of the problem?

Kind regards,
Lorenz
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx