Hello!
Hope you all started the new year well...
New year, same problem: Still having the issue with the frozen radosgw
buckets. Some information:
* Ceph 12.2.2 with bluestore
* 3 OSD nodes, each housing 2 SSD OSDs for bucket index and 4 OSDs for
bucket data, each having 64GB RAM and 16 cores
* 10GB cluster network, 4x1GB public network
* Hardware seems fine, no errors relating to disks/SSDs in system logs
or drive diagnostics.
In the mean time, I managed to migrate all of the application to some
other S3 provider, so I am more free now to debug, restart services as I
like, etc., as there are now absolutely no clients accessing it.
I disabled automatic resharding to control damage.. but sll remaining
buckets now already have entered this strange failed state, where access
results in timeouts:
If I try to access some buckets (all of these have been resharded in the
past), the S3 API call (list objects) just times out. While that
happens, the cluster sometimes reports 1-4 slow requests, entering
HEALTH_WARN state until the request times out. There's also some SSD
read activity on all OSDs carrying the bucket index pools, and some
activity on some of the OSDs carrying the bucket data.
Timeout occurs after ca. 90s. The bucket is supposed to have 822 objects.
{
"bucket": "xxx",
"tenant": "",
"num_objects": 822,
"num_shards": 128,
"objects_per_shard": 6,
"fill_status": "OK"
}
(When the bucket was still semi-accessible, I managed to delete most
objects, hence the big number of shards).
Before I go on, some (stupid?) questions to validate:
* Automatic resharding is supposed to work when I have multiple radosgw
processes behind a load balancer?
* Automatic resharding should play well with versioning-enabled buckets?
* Versioning in buckets and lifecycle rules are considered stable
features ready for production?
* There is no "down-sharding" for shrinking buckets I could try?
How do I continue to debug this? During all of this, I see absolutely no
error messages in any log file (osd, mon, mgr or radosgw)... I also
think the hardware is beefy enough to list 822 objects in under 90s, or
did I miss something?
Thanks all :)
Martin
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com