Re: Performance issues RGW (S3)

sinan@xxxxxxxx · Mon, 10 Jun 2024 17:18:44 +0200

On 2024-06-10 15:20, Anthony D'Atri wrote:
Hi all,

My Ceph setup:
- 12 OSD nodes, 4 OSD nodes per rack. Replication of 3, 1 replica per 
rack.
- 20 spinning SAS disks per node.

Don't use legacy HDDs if you care about performance.

You are right here, but we use Ceph mainly for RBD. It performs 'good 
enough' for our RBD load.

- Some nodes have 256GB RAM, some nodes 128GB.

128GB is on the low side for 20 OSDs.

Agreed, but with 20 OSD's x osd_memory_target 4GB (80GB) it is enough. 
We haven't had any server that OOM'ed yet.

- CPU varies between Intel E5-2650 and Intel Gold 5317.

E5-2650 is underpowered for 20 OSDs.  5317 isn't the ideal fit, it'd 
make a decent MDS system but assuming a dual socket system, you have ~2 
threads per OSD, which is maybe acceptable for HDDs, but I assume you 
have mon/mgr/rgw on some of them too.

The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't 
hosted on the OSD nodes and are running on modern hardware.

- Each node has 10Gbit/s network.

Using rados bench I am getting decent results (depending on block 
size):
- 1000 MB/s throughput, 1000 IOps with 1MB block size
- 30 MB/s throughput, 7500 IOps with 4K block size

rados bench is a useful for smoke testing but not always a reflection 
of E2E experience.

Unfortunately not getting the same performance with Rados Gateway 
(S3).

- 1x HAProxy with 3 backend RGW's.

Run an RGW on every node.

On every OSD node?

I am using Minio Warp for benchmarking (PUT). I am 1 Warp server and 5 
Warp clients. Benchmarking towards the HAProxy.

Results:
- Using 10MB object size, I am hitting the 10Gbit/s link of the 
HAProxy server. Thats good.
- Using 500K object size, I am getting a throughput of 70 up to 150 
MB/s with 140 up to 300 obj/s.

Tiny objects are the devil of any object storage deployment.  The HDDs 
are killing you here, especially for the index pool.  You might get a 
bit better by upping pg_num from the party line.

I would expect high write await times, but all OSD/disks have write 
await times of 1 ms up to 3 ms.

You might also disable Nagle on the RGW nodes.

I need to lookup what that exactly is and does.

It depends on the concurrency setting of Warp.

It look likes the objects/s is the bottleneck, not the throughput.

Max memory usage is about 80-90GB per node. CPU's are quite idling.

Is it reasonable to expect more IOps / objects/s for RGW with my 
setup? At this moment I am not able to find the bottleneck what is 
causing the low obj/s.

HDDs are a false economy.

Got it :)

Ceph version is 15.2.

Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx