Re: Performance issues RGW (S3)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2024-06-10 17:42, Anthony D'Atri wrote:
- 20 spinning SAS disks per node.
Don't use legacy HDDs if you care about performance.

You are right here, but we use Ceph mainly for RBD. It performs 'good enough' for our RBD load.

You use RBD for archival?

No, storage for (light-weight) virtual machines.


- Some nodes have 256GB RAM, some nodes 128GB.
128GB is on the low side for 20 OSDs.

Agreed, but with 20 OSD's x osd_memory_target 4GB (80GB) it is enough. We haven't had any server that OOM'ed yet.

Remember that's a *target* not a *limit*. Say one or more of your failure domains goes offline or you have some other large topology change. Your OSDs might want up to 2x osd_memory_target, then you OOM and it cascades. I've been there, had to do an emergency upgrade of 24xOSD nodes from 128GB to 192GB.

+1

- CPU varies between Intel E5-2650 and Intel Gold 5317.
E5-2650 is underpowered for 20 OSDs. 5317 isn't the ideal fit, it'd make a decent MDS system but assuming a dual socket system, you have ~2 threads per OSD, which is maybe acceptable for HDDs, but I assume you have mon/mgr/rgw on some of them too.

The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't hosted on the OSD nodes and are running on modern hardware.

You didn't list additional nodes so I assumed. You might still do well to have a larger number of RGWs, wherever they run. RGWs often scale better horizontally than vertically.

Good to know. I'll check if adding more RGW nodes is possible.


rados bench is a useful for smoke testing but not always a reflection of E2E experience.
Unfortunately not getting the same performance with Rados Gateway (S3).
- 1x HAProxy with 3 backend RGW's.
Run an RGW on every node.

On every OSD node?

Yep, why not?


I am using Minio Warp for benchmarking (PUT). I am 1 Warp server and 5 Warp clients. Benchmarking towards the HAProxy.
Results:
- Using 10MB object size, I am hitting the 10Gbit/s link of the HAProxy server. Thats good. - Using 500K object size, I am getting a throughput of 70 up to 150 MB/s with 140 up to 300 obj/s.
Tiny objects are the devil of any object storage deployment. The HDDs are killing you here, especially for the index pool. You might get a bit better by upping pg_num from the party line.

I would expect high write await times, but all OSD/disks have write await times of 1 ms up to 3 ms.

There are still serializations in the OSD and PG code. You have 240 OSDs, does your index pool have *at least* 256 PGs?

Index as the data pool has 256 PG's.



You might also disable Nagle on the RGW nodes.

I need to lookup what that exactly is and does.

It depends on the concurrency setting of Warp.
It look likes the objects/s is the bottleneck, not the throughput.
Max memory usage is about 80-90GB per node. CPU's are quite idling.
Is it reasonable to expect more IOps / objects/s for RGW with my setup? At this moment I am not able to find the bottleneck what is causing the low obj/s.
HDDs are a false economy.

Got it :)

Ceph version is 15.2.
Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux