Re: Performance issues RGW (S3)

sinan@xxxxxxxx · Mon, 10 Jun 2024 20:57:55 +0200

On 2024-06-10 17:42, Anthony D'Atri wrote:
- 20 spinning SAS disks per node.
Don't use legacy HDDs if you care about performance.

You are right here, but we use Ceph mainly for RBD. It performs 'good 
enough' for our RBD load.

You use RBD for archival?

No, storage for (light-weight) virtual machines.

- Some nodes have 256GB RAM, some nodes 128GB.
128GB is on the low side for 20 OSDs.

Agreed, but with 20 OSD's x osd_memory_target 4GB (80GB) it is enough. 
We haven't had any server that OOM'ed yet.

Remember that's a *target* not a *limit*.  Say one or more of your 
failure domains goes offline or you have some other large topology 
change.  Your OSDs might want up to 2x osd_memory_target, then you OOM 
and it cascades.  I've been there, had to do an emergency upgrade of 
24xOSD nodes from 128GB to 192GB.

+1

- CPU varies between Intel E5-2650 and Intel Gold 5317.
E5-2650 is underpowered for 20 OSDs.  5317 isn't the ideal fit, it'd 
make a decent MDS system but assuming a dual socket system, you have 
~2 threads per OSD, which is maybe acceptable for HDDs, but I assume 
you have mon/mgr/rgw on some of them too.

The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't 
hosted on the OSD nodes and are running on modern hardware.

You didn't list additional nodes so I assumed.  You might still do well 
to have a larger number of RGWs, wherever they run.  RGWs often scale 
better horizontally than vertically.

Good to know. I'll check if adding more RGW nodes is possible.

rados bench is a useful for smoke testing but not always a reflection 
of E2E experience.
Unfortunately not getting the same performance with Rados Gateway 
(S3).
- 1x HAProxy with 3 backend RGW's.
Run an RGW on every node.

On every OSD node?

Yep, why not?

I am using Minio Warp for benchmarking (PUT). I am 1 Warp server and 
5 Warp clients. Benchmarking towards the HAProxy.
Results:
- Using 10MB object size, I am hitting the 10Gbit/s link of the 
HAProxy server. Thats good.
- Using 500K object size, I am getting a throughput of 70 up to 150 
MB/s with 140 up to 300 obj/s.
Tiny objects are the devil of any object storage deployment.  The 
HDDs are killing you here, especially for the index pool.  You might 
get a bit better by upping pg_num from the party line.

I would expect high write await times, but all OSD/disks have write 
await times of 1 ms up to 3 ms.

There are still serializations in the OSD and PG code.  You have 240 
OSDs, does your index pool have *at least* 256 PGs?

Index as the data pool has 256 PG's.

You might also disable Nagle on the RGW nodes.

I need to lookup what that exactly is and does.

It depends on the concurrency setting of Warp.
It look likes the objects/s is the bottleneck, not the throughput.
Max memory usage is about 80-90GB per node. CPU's are quite idling.
Is it reasonable to expect more IOps / objects/s for RGW with my 
setup? At this moment I am not able to find the bottleneck what is 
causing the low obj/s.
HDDs are a false economy.

Got it :)

Ceph version is 15.2.
Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx