Re: Experience with 100G Ceph in Proxmox

darren@xxxxxxxxxxxx · Sun, 16 Mar 2025 22:29:54 +0000

So looking at this your IOPS are controlled by latency. 

You have 2 options to increase your IOPS reduce the latency OR increase the number of IOs in flight.

You currently are running an iodepth of 1 and Ceph is designed to run many IOs in parallel so you could either increase the number of jobs or increase the IOdepth.

I would be increasing the iodepth to 16 or 32 as a starting point.

IOPs should then increase until the latency starts to jump up which would then mean you are finding the limits of some other component.

The other thing then is to look at how you can decrease latency if you could in theory half the latency by doing direct hardware access then you would double the IOPs.

But what I would say is 4 concurrent IOs is never going to stress a large number of NVMEs.

Sent from my iPhone

> On 16 Mar 2025, at 21:42, Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx> wrote:
> 
> Hello 😊,
> 
> Thank you very much for your response.
> 
> I give you more information.
> I do not have MS VM. I only have Debians and Ubuntu VMs.
> 
> I have a *Proxmox Cluster* with *6 hosts*. The network setup is as follows:
> 
> * *10G link* for Ceph Cluster
> * *10G link* for Ceph public
> * *1G link* for Corosync
> * *1G IPMI*
> * *10G link* for VMs
> 
> Each host has *2 or 3 OSDs (15TB NVMe)*. The hosts are *heterogeneous*, but all have *512GB RAM*.
> 
> I do not observe any bottlenecks in *htop or iftop*, and *iostat* reports only *0.12% iowait*. However, *fio* test results are concerning.
> 
> Here is the *fio* command I used:
> 
> fio --name=registry-read --ioengine=libaio --rw=randread --bs=4k --numjobs=4 --size=1G --runtime=60 --group_reporting
> registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
> ...
> fio-3.33
> Starting 4 processes
> registry-read: Laying out IO file (1 file / 1024MiB)
> registry-read: Laying out IO file (1 file / 1024MiB)
> registry-read: Laying out IO file (1 file / 1024MiB)
> registry-read: Laying out IO file (1 file / 1024MiB)
> Jobs: 4 (f=4): [r(4)][100.0%][r=39.8MiB/s][r=10.2k IOPS][eta 00m:00s]
> registry-read: (groupid=0, jobs=4): err= 0: pid=231332: Sun Mar 16 22:30:24 2025
>   read: IOPS=10.2k, BW=39.7MiB/s (41.7MB/s)(2385MiB/60001msec)
>     slat (usec): min=194, max=13111, avg=390.63, stdev=80.29
>     clat (nsec): min=910, max=190362, avg=1521.76, stdev=873.64
>      lat (usec): min=195, max=13114, avg=392.15, stdev=80.35
>     clat percentiles (nsec):
>      |  1.00th=[ 1112],  5.00th=[ 1208], 10.00th=[ 1224], 20.00th=[ 1272],
>      | 30.00th=[ 1288], 40.00th=[ 1320], 50.00th=[ 1352], 60.00th=[ 1400],
>      | 70.00th=[ 1496], 80.00th=[ 1704], 90.00th=[ 1960], 95.00th=[ 2224],
>      | 99.00th=[ 2832], 99.50th=[ 3856], 99.90th=[12096], 99.95th=[16768],
>      | 99.99th=[26240]
>    bw (  KiB/s): min=31984, max=43288, per=100.00%, avg=40730.22, stdev=381.52, samples=476
>    iops        : min= 7996, max=10822, avg=10182.55, stdev=95.38, samples=476
>   lat (nsec)   : 1000=0.02%
>   lat (usec)   : 2=91.02%, 4=8.48%, 10=0.32%, 20=0.12%, 50=0.03%
>   lat (usec)   : 100=0.01%, 250=0.01%
>   cpu          : usr=0.80%, sys=5.99%, ctx=610640, majf=0, minf=47
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwts: total=610483,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    READ: bw=39.7MiB/s (41.7MB/s), 39.7MiB/s-39.7MiB/s (41.7MB/s-41.7MB/s), io=2385MiB (2501MB), run=60001-60001msec
> 
> Summary:
> 
> *Test Results:*
> 
> * *IOPS:* 10.2k
> * *Bandwidth:* 39.7MiB/s (41.7MB/s)
> * *Latency:*
>     o Avg: *392µs*
>     o 99.9th percentile: *12ms*
> * *CPU Usage:* usr=0.80%, sys=5.99%
> 
> Kind regards,
> Gio
> 
> 
> 
> 
> 
>> Am 11.03.2025 um 11:55 schrieb Giovanna Ratini:
>> Hello everyone,
>> 
>> We are running Ceph in Proxmox with a 10G network.
>> 
>> Unfortunately, we are experiencing very low read rates. I will try to implement the solution recommended in the Proxmox forum. However, even 80 MB per second with an NVMe drive is quite disappointing.
>> Forum link <https://forum.proxmox.com/threads/slow-performance-on-ceph-per-vm.151223/#post-685070>
>> 
>> For this reason, we are considering purchasing a 100G switch for our servers.
>> 
>> This raises some questions:
>> Should I still use separate networks for VMs and Ceph with 100G?
>> I have read that running Ceph on bridged connections is not recommended.
>> 
>> Does anyone have experience with 100G Ceph in Proxmox?
>> 
>> Is upgrading to 100G a good idea, or will I have 60G sitting idle?
>> 
>> Thanks in advance!
>> 
>> Gio
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx