RadosGW still stuck on buckets

Martin Emrich <martin.emrich@xxxxxxxxxxx> · Fri, 5 Jan 2018 09:16:53 +0100

Hello!

Hope you all started the new year well...

New year, same problem: Still having the issue with the frozen radosgw 
buckets. Some information:

* Ceph 12.2.2 with bluestore

* 3 OSD nodes, each housing 2 SSD OSDs for bucket index and 4 OSDs for 
bucket data, each having 64GB RAM and 16 cores

* 10GB cluster network, 4x1GB public network

* Hardware seems fine, no errors relating to disks/SSDs in system logs 
or drive diagnostics.

In the mean time, I managed to migrate all of the application to some 
other S3 provider, so I am more free now to debug, restart services as I 
like, etc., as there are now absolutely no clients accessing it.

I disabled automatic resharding to control damage.. but sll remaining 
buckets now already have entered this strange failed state, where access 
results in timeouts:

If I try to access some buckets (all of these have been resharded in the 
past), the S3 API call (list objects) just times out. While that 
happens, the cluster sometimes reports 1-4 slow requests, entering 
HEALTH_WARN state until the request times out. There's also some SSD 
read activity on all OSDs carrying the bucket index pools, and some 
activity on some of the OSDs carrying the bucket data.

Timeout occurs after ca. 90s. The bucket is supposed to have 822 objects.

            {
                "bucket": "xxx",
                "tenant": "",
                "num_objects": 822,
                "num_shards": 128,
                "objects_per_shard": 6,
                "fill_status": "OK"
            }

(When the bucket was still semi-accessible, I managed to delete most 
objects, hence the big number of shards).

Before I go on, some (stupid?) questions to validate:

* Automatic resharding is supposed to work when I have multiple radosgw 
processes behind a load balancer?

* Automatic resharding should play well with versioning-enabled buckets?

* Versioning in buckets and lifecycle rules are considered stable 
features ready for production?

* There is no "down-sharding" for shrinking buckets I could try?

How do I continue to debug this? During all of this, I see absolutely no 
error messages in any log file (osd, mon, mgr or radosgw)... I also 
think the hardware is beefy enough to list 822 objects in under 90s, or 
did I miss something?

Thanks all :)

Martin

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com