>>> - 20 spinning SAS disks per node. >> Don't use legacy HDDs if you care about performance. > > You are right here, but we use Ceph mainly for RBD. It performs 'good enough' for our RBD load. You use RBD for archival? >>> - Some nodes have 256GB RAM, some nodes 128GB. >> 128GB is on the low side for 20 OSDs. > > Agreed, but with 20 OSD's x osd_memory_target 4GB (80GB) it is enough. We haven't had any server that OOM'ed yet. Remember that's a *target* not a *limit*. Say one or more of your failure domains goes offline or you have some other large topology change. Your OSDs might want up to 2x osd_memory_target, then you OOM and it cascades. I've been there, had to do an emergency upgrade of 24xOSD nodes from 128GB to 192GB. >>> - CPU varies between Intel E5-2650 and Intel Gold 5317. >> E5-2650 is underpowered for 20 OSDs. 5317 isn't the ideal fit, it'd make a decent MDS system but assuming a dual socket system, you have ~2 threads per OSD, which is maybe acceptable for HDDs, but I assume you have mon/mgr/rgw on some of them too. > > The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't hosted on the OSD nodes and are running on modern hardware. You didn't list additional nodes so I assumed. You might still do well to have a larger number of RGWs, wherever they run. RGWs often scale better horizontally than vertically. > >> rados bench is a useful for smoke testing but not always a reflection of E2E experience. >>> Unfortunately not getting the same performance with Rados Gateway (S3). >>> - 1x HAProxy with 3 backend RGW's. >> Run an RGW on every node. > > On every OSD node? Yep, why not? >>> I am using Minio Warp for benchmarking (PUT). I am 1 Warp server and 5 Warp clients. Benchmarking towards the HAProxy. >>> Results: >>> - Using 10MB object size, I am hitting the 10Gbit/s link of the HAProxy server. Thats good. >>> - Using 500K object size, I am getting a throughput of 70 up to 150 MB/s with 140 up to 300 obj/s. >> Tiny objects are the devil of any object storage deployment. The HDDs are killing you here, especially for the index pool. You might get a bit better by upping pg_num from the party line. > > I would expect high write await times, but all OSD/disks have write await times of 1 ms up to 3 ms. There are still serializations in the OSD and PG code. You have 240 OSDs, does your index pool have *at least* 256 PGs? > >> You might also disable Nagle on the RGW nodes. > > I need to lookup what that exactly is and does. > >>> It depends on the concurrency setting of Warp. >>> It look likes the objects/s is the bottleneck, not the throughput. >>> Max memory usage is about 80-90GB per node. CPU's are quite idling. >>> Is it reasonable to expect more IOps / objects/s for RGW with my setup? At this moment I am not able to find the bottleneck what is causing the low obj/s. >> HDDs are a false economy. > > Got it :) > >>> Ceph version is 15.2. >>> Thanks! >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx