Re: radosgw-octopus latest - NoSuchKey Error - some buckets lose their rados objects, but not the bucket index

Boris Behrens <bb@xxxxxxxxx> · Tue, 22 Nov 2022 16:36:46 +0100

Does someone have an idea what I can check, maybe what logs I can turn on,
to find the cause of the problem? Or at least can have a monitoring that
tells me when this happens?

Currently I go through ALL of the buckets and basically do a "compare
bucket index to radoslist" for all objects in the bucket index. But I doubt
this will give me new insights.

Am Mo., 21. Nov. 2022 um 11:55 Uhr schrieb Boris Behrens <bb@xxxxxxxxx>:

> Good day people,
>
> we have a very strange problem with some bucket.
> Customer informed us, that they had issues with objects. They are listed,
> but on a GET they receive "NoSuchKey" error.
> They did not delete anything from the bucket.
>
> We checked and `radosgw-admin bucket radoslist --bucket $BUCKET` was
> empty, but all the objects were still listed in the `radosgw-admin bi list
> --bucket`.
>
> The date when they noticed, the cluster was as healthy as it can get in
> our case. There were also no other tasks performed, including orphan
> objects search, resharding of buckets, adding or removing OSDs, rebalancing
> and so on.
>
> Some data about the cluster:
>
>    - 275 OSDs (38 SSD OSDs, 6 SSD OSDs reserved for GC, rest 8-16TB
>    spinning HDD) over 13 hosts
>    - SSD for block.db every 5 HDD OSDs
>    - The SSDs are 100GB LVs on our block.db SSDs and contain all the
>    pools that are not rgw.buckets.data and rgw.buckets.non-ec
>    - The garbage collector is on separate SSDs OSDs, which are als 100GB
>    LVs on our block.db SSDs
>    - We had to split of the GC from all other pools, because this bug (
>    https://tracker.ceph.com/issues/53585) lead to problems, where we
>    received 500s errors, from RGW
>    - We have three HAProxy frontends, each pointing to one of our RGW
>    instances (with the other two RGW daemons as fallback)
>    - We have 12 RGW daemons running in total, but only three of them are
>    connected to the outside world (3x only for GC, 3x for some zonegroup
>    restructuring, 3x for a dedicated customer with own pools)
>    - We have multiple zonegroups with one zone each. We only replicate
>    the metadata, so bucket names are unique and users get synced.
>
>
>
> Our ceph.conf:
>
>    - I replaced IP addresses, FSID, and domains
>    - the -old RGW are meant to get replaced, because we have a naming
>    conflict (all zonegroups are in one TLD and are separated by subdomain, but
>    the initial RGW is still available via TLD and not via subdomain.tld)
>
>
> [global]
> fsid                  = $FSID
> ms_bind_ipv6          = true
> ms_bind_ipv4          = false
> mon_initial_members   = s3db1, s3db2, s3db3
> mon_host              = [$s3b1-IPv6-public_network],[$s3b2-IPv6
> -public_network],[$s3b2-IPv6-public_network]
> auth_cluster_required = none
> auth_service_required = none
> auth_client_required  = none
> public_network        = $public_network/64
> #cluster_network       = $cluster_network/64
>
> [mon.s3db1]
> host = s3db1
> mon addr = [$s3b1-IPv6-public_network]:6789
>
> [mon.s3db2]
> host = s3db2
> mon addr = [$s3b2-IPv6-public_network]:6789
>
> [mon.s3db3]
> host = s3db3
> mon addr = [$s3b3-IPv6-public_network]:6789
>
> [client]
> rbd_cache = true
> rbd_cache_size = 64M
> rbd_cache_max_dirty = 48M
> rgw_print_continue = true
> rgw_enable_usage_log = true
> rgw_resolve_cname = true
> rgw_enable_apis = s3,admin,s3website
> rgw_enable_static_website = true
> rgw_trust_forwarded_https = true
>
> [client.gc-s3db1]
> rgw_frontends = "beast endpoint=[::1]:7489"
> #rgw_gc_processor_max_time = 1800
> #rgw_gc_max_concurrent_io = 20
>
> [client.eu-central-1-s3db1]
> rgw_frontends = beast endpoint=[::]:7482
> rgw_region = eu
> rgw_zone = eu-central-1
> rgw_dns_name = name.example.com
> rgw_dns_s3website_name = s3-website-name.example.com
> rgw_thread_pool_size = 512
>
> [client.eu-central-1-s3db1-old]
> rgw_frontends = beast endpoint=[::]:7480
> rgw_region = eu
> rgw_zone = eu-central-1
> rgw_dns_name = example.com
> rgw_dns_s3website_name = eu-central-1.example.com
> rgw_thread_pool_size = 512
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx