RadosGW still stuck on buckets

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello!

Hope you all started the new year well...

New year, same problem: Still having the issue with the frozen radosgw buckets. Some information:

* Ceph 12.2.2 with bluestore

* 3 OSD nodes, each housing 2 SSD OSDs for bucket index and 4 OSDs for bucket data, each having 64GB RAM and 16 cores

* 10GB cluster network, 4x1GB public network

* Hardware seems fine, no errors relating to disks/SSDs in system logs or drive diagnostics.

In the mean time, I managed to migrate all of the application to some other S3 provider, so I am more free now to debug, restart services as I like, etc., as there are now absolutely no clients accessing it.

I disabled automatic resharding to control damage.. but sll remaining buckets now already have entered this strange failed state, where access results in timeouts:

If I try to access some buckets (all of these have been resharded in the past), the S3 API call (list objects) just times out. While that happens, the cluster sometimes reports 1-4 slow requests, entering HEALTH_WARN state until the request times out. There's also some SSD read activity on all OSDs carrying the bucket index pools, and some activity on some of the OSDs carrying the bucket data.

Timeout occurs after ca. 90s. The bucket is supposed to have 822 objects.

            {
                "bucket": "xxx",
                "tenant": "",
                "num_objects": 822,
                "num_shards": 128,
                "objects_per_shard": 6,
                "fill_status": "OK"
            }

(When the bucket was still semi-accessible, I managed to delete most objects, hence the big number of shards).

Before I go on, some (stupid?) questions to validate:

* Automatic resharding is supposed to work when I have multiple radosgw processes behind a load balancer?

* Automatic resharding should play well with versioning-enabled buckets?

* Versioning in buckets and lifecycle rules are considered stable features ready for production?

* There is no "down-sharding" for shrinking buckets I could try?

How do I continue to debug this? During all of this, I see absolutely no error messages in any log file (osd, mon, mgr or radosgw)... I also think the hardware is beefy enough to list 822 objects in under 90s, or did I miss something?

Thanks all :)

Martin

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux