Re: Performance issues RGW (S3)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




>>> - 20 spinning SAS disks per node.
>> Don't use legacy HDDs if you care about performance.
> 
> You are right here, but we use Ceph mainly for RBD. It performs 'good enough' for our RBD load.

You use RBD for archival?


>>> - Some nodes have 256GB RAM, some nodes 128GB.
>> 128GB is on the low side for 20 OSDs.
> 
> Agreed, but with 20 OSD's x osd_memory_target 4GB (80GB) it is enough. We haven't had any server that OOM'ed yet.

Remember that's a *target* not a *limit*.  Say one or more of your failure domains goes offline or you have some other large topology change.  Your OSDs might want up to 2x osd_memory_target, then you OOM and it cascades.  I've been there, had to do an emergency upgrade of 24xOSD nodes from 128GB to 192GB.

>>> - CPU varies between Intel E5-2650 and Intel Gold 5317.
>> E5-2650 is underpowered for 20 OSDs.  5317 isn't the ideal fit, it'd make a decent MDS system but assuming a dual socket system, you have ~2 threads per OSD, which is maybe acceptable for HDDs, but I assume you have mon/mgr/rgw on some of them too.
> 
> The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't hosted on the OSD nodes and are running on modern hardware.

You didn't list additional nodes so I assumed.  You might still do well to have a larger number of RGWs, wherever they run.  RGWs often scale better horizontally than vertically.

> 
>> rados bench is a useful for smoke testing but not always a reflection of E2E experience.
>>> Unfortunately not getting the same performance with Rados Gateway (S3).
>>> - 1x HAProxy with 3 backend RGW's.
>> Run an RGW on every node.
> 
> On every OSD node?

Yep, why not?


>>> I am using Minio Warp for benchmarking (PUT). I am 1 Warp server and 5 Warp clients. Benchmarking towards the HAProxy.
>>> Results:
>>> - Using 10MB object size, I am hitting the 10Gbit/s link of the HAProxy server. Thats good.
>>> - Using 500K object size, I am getting a throughput of 70 up to 150 MB/s with 140 up to 300 obj/s.
>> Tiny objects are the devil of any object storage deployment.  The HDDs are killing you here, especially for the index pool.  You might get a bit better by upping pg_num from the party line.
> 
> I would expect high write await times, but all OSD/disks have write await times of 1 ms up to 3 ms.

There are still serializations in the OSD and PG code.  You have 240 OSDs, does your index pool have *at least* 256 PGs?


> 
>> You might also disable Nagle on the RGW nodes.
> 
> I need to lookup what that exactly is and does.
> 
>>> It depends on the concurrency setting of Warp.
>>> It look likes the objects/s is the bottleneck, not the throughput.
>>> Max memory usage is about 80-90GB per node. CPU's are quite idling.
>>> Is it reasonable to expect more IOps / objects/s for RGW with my setup? At this moment I am not able to find the bottleneck what is causing the low obj/s.
>> HDDs are a false economy.
> 
> Got it :)
> 
>>> Ceph version is 15.2.
>>> Thanks!
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux