So it seems like a bucket still has objects listed in the bucket index but the underlying data objects are no longer there. Since you made reference to a customer, I’m guessing the customer does not have direct access to the cluster via `rados` commands, so there’s no chance that they could have removed the objects directly. I would look for references to the head objects in the logs…. So if you had bucket “bkt1” and object “obj1”, you could do the following: 1. Find the marker for the bucket: radosgw-admin metadata get bucket:bkt1 2. Construct the rados object name of the head object: <marker>_obj1 You’ll end up with something like "c44a7aab-e086-43df-befe-ed8151b3a209.4147.1_obj1”. 3. grep through the logs for the head object and see if you find anything. Eric (he/him) > On Nov 22, 2022, at 10:36 AM, Boris Behrens <bb@xxxxxxxxx> wrote: > > Does someone have an idea what I can check, maybe what logs I can turn on, > to find the cause of the problem? Or at least can have a monitoring that > tells me when this happens? > > Currently I go through ALL of the buckets and basically do a "compare > bucket index to radoslist" for all objects in the bucket index. But I doubt > this will give me new insights. > > Am Mo., 21. Nov. 2022 um 11:55 Uhr schrieb Boris Behrens <bb@xxxxxxxxx <mailto:bb@xxxxxxxxx>>: > >> Good day people, >> >> we have a very strange problem with some bucket. >> Customer informed us, that they had issues with objects. They are listed, >> but on a GET they receive "NoSuchKey" error. >> They did not delete anything from the bucket. >> >> We checked and `radosgw-admin bucket radoslist --bucket $BUCKET` was >> empty, but all the objects were still listed in the `radosgw-admin bi list >> --bucket`. >> >> The date when they noticed, the cluster was as healthy as it can get in >> our case. There were also no other tasks performed, including orphan >> objects search, resharding of buckets, adding or removing OSDs, rebalancing >> and so on. >> >> Some data about the cluster: >> >> - 275 OSDs (38 SSD OSDs, 6 SSD OSDs reserved for GC, rest 8-16TB >> spinning HDD) over 13 hosts >> - SSD for block.db every 5 HDD OSDs >> - The SSDs are 100GB LVs on our block.db SSDs and contain all the >> pools that are not rgw.buckets.data and rgw.buckets.non-ec >> - The garbage collector is on separate SSDs OSDs, which are als 100GB >> LVs on our block.db SSDs >> - We had to split of the GC from all other pools, because this bug ( >> https://tracker.ceph.com/issues/53585) lead to problems, where we >> received 500s errors, from RGW >> - We have three HAProxy frontends, each pointing to one of our RGW >> instances (with the other two RGW daemons as fallback) >> - We have 12 RGW daemons running in total, but only three of them are >> connected to the outside world (3x only for GC, 3x for some zonegroup >> restructuring, 3x for a dedicated customer with own pools) >> - We have multiple zonegroups with one zone each. We only replicate >> the metadata, so bucket names are unique and users get synced. >> >> >> >> Our ceph.conf: >> >> - I replaced IP addresses, FSID, and domains >> - the -old RGW are meant to get replaced, because we have a naming >> conflict (all zonegroups are in one TLD and are separated by subdomain, but >> the initial RGW is still available via TLD and not via subdomain.tld) >> >> >> [global] >> fsid = $FSID >> ms_bind_ipv6 = true >> ms_bind_ipv4 = false >> mon_initial_members = s3db1, s3db2, s3db3 >> mon_host = [$s3b1-IPv6-public_network],[$s3b2-IPv6 >> -public_network],[$s3b2-IPv6-public_network] >> auth_cluster_required = none >> auth_service_required = none >> auth_client_required = none >> public_network = $public_network/64 >> #cluster_network = $cluster_network/64 >> >> [mon.s3db1] >> host = s3db1 >> mon addr = [$s3b1-IPv6-public_network]:6789 >> >> [mon.s3db2] >> host = s3db2 >> mon addr = [$s3b2-IPv6-public_network]:6789 >> >> [mon.s3db3] >> host = s3db3 >> mon addr = [$s3b3-IPv6-public_network]:6789 >> >> [client] >> rbd_cache = true >> rbd_cache_size = 64M >> rbd_cache_max_dirty = 48M >> rgw_print_continue = true >> rgw_enable_usage_log = true >> rgw_resolve_cname = true >> rgw_enable_apis = s3,admin,s3website >> rgw_enable_static_website = true >> rgw_trust_forwarded_https = true >> >> [client.gc-s3db1] >> rgw_frontends = "beast endpoint=[::1]:7489" >> #rgw_gc_processor_max_time = 1800 >> #rgw_gc_max_concurrent_io = 20 >> >> [client.eu-central-1-s3db1] >> rgw_frontends = beast endpoint=[::]:7482 >> rgw_region = eu >> rgw_zone = eu-central-1 >> rgw_dns_name = name.example.com >> rgw_dns_s3website_name = s3-website-name.example.com >> rgw_thread_pool_size = 512 >> >> [client.eu-central-1-s3db1-old] >> rgw_frontends = beast endpoint=[::]:7480 >> rgw_region = eu >> rgw_zone = eu-central-1 >> rgw_dns_name = example.com >> rgw_dns_s3website_name = eu-central-1.example.com >> rgw_thread_pool_size = 512 >> > > > -- > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im > groüen Saal. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> > To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx