Re: radosgw-octopus latest - NoSuchKey Error - some buckets lose their rados objects, but not the bucket index

"J. Eric Ivancich" <ivancich@xxxxxxxxxx> · Thu, 1 Dec 2022 11:26:31 -0500

So it seems like a bucket still has objects listed in the bucket index but the underlying data objects are no longer there. Since you made reference to a customer, I’m guessing the customer does not have direct access to the cluster via `rados` commands, so there’s no chance that they could have removed the objects directly.

I would look for references to the head objects in the logs….

So if you had bucket “bkt1” and object “obj1”, you could do the following:

1. Find the marker for the bucket:
    radosgw-admin metadata get bucket:bkt1

2. Construct the rados object name of the head object:
    <marker>_obj1

You’ll end up with something like "c44a7aab-e086-43df-befe-ed8151b3a209.4147.1_obj1”.

3. grep through the logs for the head object and see if you find anything.

Eric
(he/him)

> On Nov 22, 2022, at 10:36 AM, Boris Behrens <bb@xxxxxxxxx> wrote:
> 
> Does someone have an idea what I can check, maybe what logs I can turn on,
> to find the cause of the problem? Or at least can have a monitoring that
> tells me when this happens?
> 
> Currently I go through ALL of the buckets and basically do a "compare
> bucket index to radoslist" for all objects in the bucket index. But I doubt
> this will give me new insights.
> 
> Am Mo., 21. Nov. 2022 um 11:55 Uhr schrieb Boris Behrens <bb@xxxxxxxxx <mailto:bb@xxxxxxxxx>>:
> 
>> Good day people,
>> 
>> we have a very strange problem with some bucket.
>> Customer informed us, that they had issues with objects. They are listed,
>> but on a GET they receive "NoSuchKey" error.
>> They did not delete anything from the bucket.
>> 
>> We checked and `radosgw-admin bucket radoslist --bucket $BUCKET` was
>> empty, but all the objects were still listed in the `radosgw-admin bi list
>> --bucket`.
>> 
>> The date when they noticed, the cluster was as healthy as it can get in
>> our case. There were also no other tasks performed, including orphan
>> objects search, resharding of buckets, adding or removing OSDs, rebalancing
>> and so on.
>> 
>> Some data about the cluster:
>> 
>>   - 275 OSDs (38 SSD OSDs, 6 SSD OSDs reserved for GC, rest 8-16TB
>>   spinning HDD) over 13 hosts
>>   - SSD for block.db every 5 HDD OSDs
>>   - The SSDs are 100GB LVs on our block.db SSDs and contain all the
>>   pools that are not rgw.buckets.data and rgw.buckets.non-ec
>>   - The garbage collector is on separate SSDs OSDs, which are als 100GB
>>   LVs on our block.db SSDs
>>   - We had to split of the GC from all other pools, because this bug (
>>   https://tracker.ceph.com/issues/53585) lead to problems, where we
>>   received 500s errors, from RGW
>>   - We have three HAProxy frontends, each pointing to one of our RGW
>>   instances (with the other two RGW daemons as fallback)
>>   - We have 12 RGW daemons running in total, but only three of them are
>>   connected to the outside world (3x only for GC, 3x for some zonegroup
>>   restructuring, 3x for a dedicated customer with own pools)
>>   - We have multiple zonegroups with one zone each. We only replicate
>>   the metadata, so bucket names are unique and users get synced.
>> 
>> 
>> 
>> Our ceph.conf:
>> 
>>   - I replaced IP addresses, FSID, and domains
>>   - the -old RGW are meant to get replaced, because we have a naming
>>   conflict (all zonegroups are in one TLD and are separated by subdomain, but
>>   the initial RGW is still available via TLD and not via subdomain.tld)
>> 
>> 
>> [global]
>> fsid                  = $FSID
>> ms_bind_ipv6          = true
>> ms_bind_ipv4          = false
>> mon_initial_members   = s3db1, s3db2, s3db3
>> mon_host              = [$s3b1-IPv6-public_network],[$s3b2-IPv6
>> -public_network],[$s3b2-IPv6-public_network]
>> auth_cluster_required = none
>> auth_service_required = none
>> auth_client_required  = none
>> public_network        = $public_network/64
>> #cluster_network       = $cluster_network/64
>> 
>> [mon.s3db1]
>> host = s3db1
>> mon addr = [$s3b1-IPv6-public_network]:6789
>> 
>> [mon.s3db2]
>> host = s3db2
>> mon addr = [$s3b2-IPv6-public_network]:6789
>> 
>> [mon.s3db3]
>> host = s3db3
>> mon addr = [$s3b3-IPv6-public_network]:6789
>> 
>> [client]
>> rbd_cache = true
>> rbd_cache_size = 64M
>> rbd_cache_max_dirty = 48M
>> rgw_print_continue = true
>> rgw_enable_usage_log = true
>> rgw_resolve_cname = true
>> rgw_enable_apis = s3,admin,s3website
>> rgw_enable_static_website = true
>> rgw_trust_forwarded_https = true
>> 
>> [client.gc-s3db1]
>> rgw_frontends = "beast endpoint=[::1]:7489"
>> #rgw_gc_processor_max_time = 1800
>> #rgw_gc_max_concurrent_io = 20
>> 
>> [client.eu-central-1-s3db1]
>> rgw_frontends = beast endpoint=[::]:7482
>> rgw_region = eu
>> rgw_zone = eu-central-1
>> rgw_dns_name = name.example.com
>> rgw_dns_s3website_name = s3-website-name.example.com
>> rgw_thread_pool_size = 512
>> 
>> [client.eu-central-1-s3db1-old]
>> rgw_frontends = beast endpoint=[::]:7480
>> rgw_region = eu
>> rgw_zone = eu-central-1
>> rgw_dns_name = example.com
>> rgw_dns_s3website_name = eu-central-1.example.com
>> rgw_thread_pool_size = 512
>> 
> 
> 
> -- 
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groÃƒ¼en Saal.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx