Re: Experience with 100G Ceph in Proxmox

Eneko Lacunza <elacunza@xxxxxxxxx> · Thu, 20 Mar 2025 16:57:52 +0100

Hi Chris,

I tried KRBD, even with a newly created disk and after shuting down and 
starting VM again, but no measurable difference.

Our Ceph is 18.2.4, that may be a factor to consider, but 9k -> 273k?!

Maybe Giovanna can test KRBD option and report back... :)

Cheers

El 20/3/25 a las 16:19, Chris Palmer escribió:
HI Eneko

No containers. In the Promox console go to Datacenter\Storage, click 
on the storage you are using, then Edit. There is a tick box KRBD. 
With that set, any virtual disks created in that storage will use KRBD 
rather than librbd. So it applies to all VMs that use that storage.

Chris

On 20/03/2025 15:00, Eneko Lacunza wrote:

Chris, you tested from a container? Or how do you configure a KRBD 
disk for a VM?

El 20/3/25 a las 15:15, Chris Palmer escribió:
I just ran that command on one of my VMs. Salient details:

  * Ceph cluster 19.2.1 with 3 nodes, 4 x SATA disks with shared NVMe
    DB/WAL, single 10g NICs
  * Promox 8.3.5 cluster with 2 nodes (separate nodes to Ceph), single
    10g NICs , single 1g NICs for corosync
  * Test VM was using KRBD R3 pool on HDD, iothread=1, aio=io_uring,
    cache=writeback

The results are very different:

# fio --name=registry-read --ioengine=libaio --rw=randread --bs=4k 
--numjobs=4 --size=1G --runtime=60 --group_reporting --iodepth=16
registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 
4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.37
Starting 4 processes
Jobs: 4 (f=4): [r(4)][-.-%][r=1080MiB/s][r=277k IOPS][eta 00m:00s]
registry-read: (groupid=0, jobs=4): err= 0: pid=13355: Thu Mar 20 
13:57:05 2025
  read: IOPS=273k, BW=1068MiB/s (1120MB/s)(4096MiB/3835msec)
    slat (usec): min=7, max=3802, avg=13.77, stdev= 6.41
    clat (nsec): min=599, max=4395.1k, avg=215298.68, stdev=38131.71
     lat (usec): min=11, max=4408, avg=229.07, stdev=40.01
    clat percentiles (usec):
     |  1.00th=[  194],  5.00th=[  200], 10.00th=[  202], 20.00th=[  
204],
     | 30.00th=[  206], 40.00th=[  208], 50.00th=[  210], 60.00th=[  
212],
     | 70.00th=[  215], 80.00th=[  217], 90.00th=[  227], 95.00th=[  
243],
     | 99.00th=[  367], 99.50th=[  420], 99.90th=[  594], 99.95th=[  
668],
     | 99.99th=[  963]
   bw (  MiB/s): min=  920, max= 1118, per=100.00%, avg=1068.04, 
stdev=16.81, samples=28
   iops        : min=235566, max=286286, avg=273417.14, 
stdev=4303.79, samples=28
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 20=0.01%, 50=0.01%, 100=0.01%, 250=96.06%, 500=3.67%
  lat (usec)   : 750=0.24%, 1000=0.02%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=4.68%, sys=29.99%, ctx=1048987, majf=0, minf=102
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 
64=0.0%, >=64=0.0%
     issued rwts: total=1048576,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=1068MiB/s (1120MB/s), 1068MiB/s-1068MiB/s 
(1120MB/s-1120MB/s), io=4096MiB (4295MB), run=3835-3835msec

Disk stats (read/write):
  sdc: ios=999346/0, sectors=7994768/0, merge=0/0, ticks=10360/0, 
in_queue=10361, util=95.49%

On 20/03/2025 12:23, Eneko Lacunza wrote:
Hi Giovanna,

I just tested one of my VMs:
# fio --name=registry-read --ioengine=libaio --rw=randread --bs=4k 
--numjobs=4 --size=1G --runtime=60 --group_reporting
registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 
4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 
4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.33
Starting 4 processes
registry-read: Laying out IO file (1 file / 1024MiB)
registry-read: Laying out IO file (1 file / 1024MiB)
registry-read: Laying out IO file (1 file / 1024MiB)
registry-read: Laying out IO file (1 file / 1024MiB)
Jobs: 4 (f=0): [f(4)][100.0%][r=33.5MiB/s][r=8578 IOPS][eta 00m:00s]
registry-read: (groupid=0, jobs=4): err= 0: pid=24261: Thu Mar 20 
12:57:26 2025
  read: IOPS=8538, BW=33.4MiB/s (35.0MB/s)(2001MiB/60001msec)
    slat (usec): min=309, max=4928, avg=464.54, stdev=73.15
    clat (nsec): min=602, max=1532.4k, avg=1999.15, stdev=3724.16
     lat (usec): min=310, max=4931, avg=466.54, stdev=73.36
    clat percentiles (nsec):
     |  1.00th=[  812],  5.00th=[  884], 10.00th=[  940], 20.00th=[ 
1096],
     | 30.00th=[ 1368], 40.00th=[ 1576], 50.00th=[ 1720], 60.00th=[ 
1832],
     | 70.00th=[ 1944], 80.00th=[ 2096], 90.00th=[ 2480], 95.00th=[ 
3024],
     | 99.00th=[12480], 99.50th=[15808], 99.90th=[47360], 
99.95th=[61696],
     | 99.99th=[90624]
   bw (  KiB/s): min=30448, max=35868, per=100.00%, avg=34155.76, 
stdev=269.75, samples=476
   iops        : min= 7612, max= 8966, avg=8538.87, stdev=67.43, 
samples=476
  lat (nsec)   : 750=0.06%, 1000=14.94%
  lat (usec)   : 2=59.18%, 4=23.07%, 10=1.28%, 20=1.17%, 50=0.21%
  lat (usec)   : 100=0.08%, 250=0.01%, 500=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=1.04%, sys=5.50%, ctx=537639, majf=0, minf=36
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 
32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
64=0.0%, >=64=0.0%
     issued rwts: total=512316,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=33.4MiB/s (35.0MB/s), 33.4MiB/s-33.4MiB/s 
(35.0MB/s-35.0MB/s), io=2001MiB (2098MB), run=60001-60001msec

Results are worse than yours, but this is on a production (not very 
busy) pool with 4x3.84TB SATA disks (4 disks total vs ~15 disks in 
your case) and 10G network.

VM cpu is x86_64_v3 and host CPU Ryzen 1700.

I gest almost the same IOPS with --iodepth=16 .

I tried moving the VM to a Ryzen 5900X and results are somewhat 
better:

# fio --name=registry-read --ioengine=libaio --rw=randread --bs=4k 
--numjobs=4 --size=1G --runtime=60 --group_reporting --iodepth=16
registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 
4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.33
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=45.4MiB/s][r=11.6k IOPS][eta 00m:00s]
registry-read: (groupid=0, jobs=4): err= 0: pid=24282: Thu Mar 20 
13:18:23 2025
  read: IOPS=11.6k, BW=45.5MiB/s (47.7MB/s)(2730MiB/60001msec)
    slat (usec): min=110, max=21206, avg=341.21, stdev=79.69
    clat (nsec): min=1390, max=42395k, avg=5147009.08, stdev=475506.40
     lat (usec): min=335, max=42779, avg=5488.22, stdev=498.03
    clat percentiles (usec):
     |  1.00th=[ 4621],  5.00th=[ 4752], 10.00th=[ 4817], 20.00th=[ 
4948],
     | 30.00th=[ 5014], 40.00th=[ 5080], 50.00th=[ 5080], 60.00th=[ 
5145],
     | 70.00th=[ 5211], 80.00th=[ 5276], 90.00th=[ 5407], 95.00th=[ 
5538],
     | 99.00th=[ 6194], 99.50th=[ 6783], 99.90th=[ 9765], 
99.95th=[12125],
     | 99.99th=[24249]
   bw (  KiB/s): min=36434, max=48352, per=100.00%, avg=46612.18, 
stdev=300.09, samples=476
   iops        : min= 9108, max=12088, avg=11653.04, stdev=75.03, 
samples=476
  lat (usec)   : 2=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=99.90%, 20=0.08%, 50=0.01%
  cpu          : usr=0.98%, sys=4.18%, ctx=706399, majf=0, minf=99
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 
32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 
64=0.0%, >=64=0.0%
     issued rwts: total=698956,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=45.5MiB/s (47.7MB/s), 45.5MiB/s-45.5MiB/s 
(47.7MB/s-47.7MB/s), io=2730MiB (2863MB), run=60001-60001msec

I think we're limited by the IO thread. I suggest you try multiple 
disks with SCSI Virtio single.

My VM conf:
agent: 1
boot: order=scsi0;ide2;net0
cores: 2
cpu: x86-64-v3
ide2: none,media=cdrom
memory: 2048
meta: creation-qemu=9.0.2,ctime=1739888364
name: elacunza-btrfs-test
net0: virtio=BC:24:11:47:9B:58,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: proxmox_r3_ssd2:vm-112-disk-0,discard=on,iothread=1,size=15G
scsihw: virtio-scsi-single
smbios1: uuid=263ab229-4379-4abf-b6bf-615b98ccd3d4
sockets: 1
vmgenid: 13b7f2a4-2a42-4600-845a-da88f96ae6e8

I think this is a KVM/QEMU issue, not a Ceph issue :) Maybe you can 
get better suggestions in pve-user mailing list.

Cheers

El 20/3/25 a las 12:29, Giovanna Ratini escribió:
Hello Eneko,

this is my configuration. The performance is similar across all 
VMs. I am now checking GitLab, as that is where people are 
complaining the most.

agent: 1
balloon: 65000
bios: ovmf
boot: order=scsi0;net0
cores: 10
cpu: host
efidisk0: cephvm:vm-6506-disk-0,efitype=4m,size=528K
memory: 130000
meta: creation-qemu=9.0.2,ctime=1734995123
name: gitlab02
net0: virtio=BC:24:11:6E:28:71,bridge=vmbr1,firewall=1
numa: 0
ostype: l26
scsi0: 
cephvm:vm-6506-disk-1,aio=native,cache=writeback,iothread=1,size=64G,ssd=1
scsi1: 
cephvm:vm-6506-disk-2,aio=native,cache=writeback,iothread=1,size=10T,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=0a5294c0-c82a-40f2-aae4-f5880022a2ac
sockets: 2
vmgenid: ea610fde-6c71-4b7f-9257-fa431a428e16

Cheers,

Gio

Am 20.03.2025 um 10:23 schrieb Eneko Lacunza:
Hi Giovanna,

Can you post VM's full config?

Also, can you test with IO thread enabled and SCSI virtio single, 
and multiple disks?

Cheers

El 19/3/25 a las 17:27, Giovanna Ratini escribió:

hello Eneko,

Yes I did.  No significant changes.  :-(
Cheers,

Gio

Am Mittwoch, März 19, 2025 13:09 CET, schrieb Eneko Lacunza 
<elacunza@xxxxxxxxx>:

Hi Giovanna,

Have you tried increasing iothreads option for the VM?

Cheers

El 18/3/25 a las 19:13, Giovanna Ratini escribió:
> Hello Antony,
>
> no, no QoS applied to Vms.
>
> The Server has PCIe Gen 4
>
> ceph osd dump | grep pool
> pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 
object_hash
> rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 21 
flags
> hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 
application mgr
> read_balance_score 13.04
> pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
> last_change 598 lfor 0/598/596 flags hashpspool stripe_width 0
> application cephfs read_balance_score 2.02
> pool 3 'cephfs_metadata' replicated size 3 min_size 2 
crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
> last_change 50 flags hashpspool stripe_width 0 
pg_autoscale_bias 4
> pg_num_min 16 recovery_priority 5 application cephfs
> read_balance_score 2.42
> pool 4 'cephvm' replicated size 3 min_size 2 crush_rule 0 
object_hash
> rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 
16386
> lfor 0/644/2603 flags hashpspool,selfmanaged_snaps 
stripe_width 0
> application rbd read_balance_score 1.52
>
> I think, this is the default config. 🙈
>
> I will search for my chassies supermicro upgrade.
>
> Thank you
>
>
> Am 18.03.2025 um 17:57 schrieb Anthony D'Atri:
>>> Then I tested on the *Proxmox host*, and the results were
>>> significantly better.
>> My Proxmox prowess is limited, but from my experience with 
other
>> virtualization platforms, I have to ask if there is any QoS
>> throttling applied to VMs.  With OpenStack or DO there is 
often IOPS
>> and/or throughput throttling via libvirt to mitigate noisy 
neighbors.
>>
>>>   fio --name=host-test --filename=/dev/rbd0 --ioengine=libaio
>>> --rw=randread --bs=4k --numjobs=4 --iodepth=32 --size=1G
>>> --runtime=60 --group_reporting
>>>
>>> *IOPS*: *1.54M*
>>>
>>> # *Bandwidth*: *6032MiB/s (6325MB/s)*
>>> # *Latency*:
>>>
>>> * *Avg*: *39.8µs*
>>> * *99.9th percentile*: *71µs*
>>>
>>> # *CPU Usage*: *usr=22.60%, sys=77.13%*
>>> #
>>>
>>> Am 18.03.2025 um 15:27 schrieb Anthony D'Atri:
>>>> Which NVMe drive SKUs specifically?
>>> # */dev/nvme6n1* – *KCD61LUL15T3* – 15.36 TB – SN: 
6250A02QT5A8
>>> # */dev/nvme5n1* – *KCD61LUL15T3* – 15.36 TB – SN: 
42R0A036T5A8
>>> # */dev/nvme4n1* – *KCD61LUL15T3* – 15.36 TB – SN: 
6250A02UT5A8
>> Kioxia CD6.  If you were using client-class drives all 
manner of
>> performance issues would be expected.
>>
>> Is your server chassis at least PCIe Gen 4? If it’s Gen 3 
that may
>> hamper these drives.
>>
>> Also, how many of these are in your cluster? If it’s a small 
number
>> you might still benefit from chopping each into at least 2 
separate
>> OSDs.
>>
>> And please send `ceph osd dump | grep pool`, having too few PGs
>> wouldn’t do you any favors.
>>
>>
>>>> Are you running a recent kernel?
>>> penultimate: 6.8.12-8-pve (VM, yes)
>> Groovy.  If you were running like a CentOS 6 or CentOS 7 
kernel then
>> NVMe issues might be expected as old kernels had rudimentary 
NVMe
>> support.
>>
>>>>   Have you updated firmware on the NVMe devices?
>>> No.
>> Kioxia appears to not release firmware updates publicly but 
your
>> chassis brand (Dell, HP, SMCI, etc) might have an update.
>> 
e.g.https://www.dell.com/support/home/en-vc/drivers/driversdetails?driverid=7ny55 

>>
>>
>>   If there is an available update I would strongly suggest 
applying.
>
>>
>>> Thanks again,
>>>
>>> best regards,
>>> Gio
>>>
>>> _______________________________________________
>>> ceph-users mailing list --ceph-users@xxxxxxx
>>> To unsubscribe send an email toceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 <tel:+34 943 569 206> | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

     EnekoLacunza

Director Técnico | Zuzendari teknikoa

Binovo IT Human Project

    943 569 206 <tel:943 569 206>

elacunza@xxxxxxxxx <mailto:elacunza@xxxxxxxxx>

    binovo.es <//binovo.es>

    Astigarragako Bidea, 2 - 2 izda. Oficina 10-11, 20180 Oiartzun

youtube <https://www.youtube.com/user/CANALBINOVO/>
    linkedin <https://www.linkedin.com/company/37269706/>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx