Upgraded to Quincy 17.2.7: some S3 buckets inaccessible

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi everybody,

we upgraded our containerized Red Hat Pacific cluster to the latest Quincy release (Community Edition). The upgrade itself went fine, the cluster is HEALTH_OK, all daemons run the upgraded version:

---- %< ----
$ ceph -s
  cluster:
    id:     68675a58-cf09-4ebd-949c-b9fcc4f2264e
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum node02,node03,node04,node05,node01 (age 25h)
mgr: node03.ztlair(active, since 25h), standbys: node01.koymku, node04.uvxgvp, node02.znqnhg, node05.iifmpc
    osd: 408 osds: 408 up (since 22h), 408 in (since 7d)
    rgw: 19 daemons active (19 hosts, 1 zones)

  data:
    pools:   11 pools, 8481 pgs
    objects: 236.99M objects, 544 TiB
    usage:   1.6 PiB used, 838 TiB / 2.4 PiB avail
    pgs:     8385 active+clean
             79   active+clean+scrubbing+deep
             17   active+clean+scrubbing

  io:
    client:   42 MiB/s rd, 439 MiB/s wr, 2.15k op/s rd, 1.64k op/s wr

---

$ ceph versions | jq .overall
{
"ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 437
}
---- >% ----

After all the daemons were upgraded we started noticing some RGW buckets which are inaccessible.
s3cmd failed with NoSuchKey:

---- %< ----
$ s3cmd la -l
ERROR: S3 error: 404 (NoSuchKey)
---- >% ----

The buckets still exists according to "radosgw-admin bucket list".
Out of the ~600 buckets, 13 buckets are unaccessible at the moment:

---- %< ----
$ radosgw-admin bucket radoslist --tenant xy --uid xy --bucket xy
2024-04-03T12:13:40.607+0200 7f0dbf4c4680 0 int RGWRados::cls_bucket_list_ordered(const DoutPrefixProvider*, RGWBucketInfo&, int, const rgw_obj_index_key&, const string&, const string&, uint32_t, bool, uint16_t, RGWRados::ent_map_t&, bool*, bool*, rgw_obj_index_key*, optional_yield, RGWBucketListNameFilter): CLSRGWIssueBucketList for xy:xy[6955f50e-5b23-4534-9b77-c7078f60f0d0.171713434.3]) failed 2024-04-03T12:13:40.609+0200 7f0dbf4c4680 0 int RGWRados::cls_bucket_list_ordered(const DoutPrefixProvider*, RGWBucketInfo&, int, const rgw_obj_index_key&, const string&, const string&, uint32_t, bool, uint16_t, RGWRados::ent_map_t&, bool*, bool*, rgw_obj_index_key*, optional_yield, RGWBucketListNameFilter): CLSRGWIssueBucketList for xy:xy[6955f50e-5b23-4534-9b77-c7078f60f0d0.171713434.3]) failed
---- >% ----

The affected buckets are comparatively large, around 4 - 7 TB,
but not all buckets of that size are affected.

Using "rados -p rgw.buckets.data ls" it seems like all the objects are still there, although "rados -p rgw.buckets.data get objectname -" only prints unusable (?) binary data,
even for objects of intact buckets.

Overall we're facing around 60 TB of customer data which are just gone at the moment. Is there a way to recover from this situation or further narrowing down the root cause of the problem?

Kind regards,
Lorenz
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux