Re: Experience with 100G Ceph in Proxmox

Eneko Lacunza <elacunza@xxxxxxxxx> · Thu, 20 Mar 2025 13:23:04 +0100

Hi Giovanna,

I just tested one of my VMs:
# fio --name=registry-read --ioengine=libaio --rw=randread --bs=4k 
--numjobs=4 --size=1G --runtime=60 --group_reporting
registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
(T) 4096B-4096B, ioengine=libaio, iodepth=1
registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
(T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.33
Starting 4 processes
registry-read: Laying out IO file (1 file / 1024MiB)
registry-read: Laying out IO file (1 file / 1024MiB)
registry-read: Laying out IO file (1 file / 1024MiB)
registry-read: Laying out IO file (1 file / 1024MiB)
Jobs: 4 (f=0): [f(4)][100.0%][r=33.5MiB/s][r=8578 IOPS][eta 00m:00s]
registry-read: (groupid=0, jobs=4): err= 0: pid=24261: Thu Mar 20 
12:57:26 2025
  read: IOPS=8538, BW=33.4MiB/s (35.0MB/s)(2001MiB/60001msec)
    slat (usec): min=309, max=4928, avg=464.54, stdev=73.15
    clat (nsec): min=602, max=1532.4k, avg=1999.15, stdev=3724.16
     lat (usec): min=310, max=4931, avg=466.54, stdev=73.36
    clat percentiles (nsec):
     |  1.00th=[  812],  5.00th=[  884], 10.00th=[  940], 20.00th=[ 1096],
     | 30.00th=[ 1368], 40.00th=[ 1576], 50.00th=[ 1720], 60.00th=[ 1832],
     | 70.00th=[ 1944], 80.00th=[ 2096], 90.00th=[ 2480], 95.00th=[ 3024],
     | 99.00th=[12480], 99.50th=[15808], 99.90th=[47360], 99.95th=[61696],
     | 99.99th=[90624]
   bw (  KiB/s): min=30448, max=35868, per=100.00%, avg=34155.76, 
stdev=269.75, samples=476
   iops        : min= 7612, max= 8966, avg=8538.87, stdev=67.43, 
samples=476
  lat (nsec)   : 750=0.06%, 1000=14.94%
  lat (usec)   : 2=59.18%, 4=23.07%, 10=1.28%, 20=1.17%, 50=0.21%
  lat (usec)   : 100=0.08%, 250=0.01%, 500=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=1.04%, sys=5.50%, ctx=537639, majf=0, minf=36
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued rwts: total=512316,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=33.4MiB/s (35.0MB/s), 33.4MiB/s-33.4MiB/s 
(35.0MB/s-35.0MB/s), io=2001MiB (2098MB), run=60001-60001msec

Results are worse than yours, but this is on a production (not very 
busy) pool with 4x3.84TB SATA disks (4 disks total vs ~15 disks in your 
case) and 10G network.

VM cpu is x86_64_v3 and host CPU Ryzen 1700.

I gest almost the same IOPS with --iodepth=16 .

I tried moving the VM to a Ryzen 5900X and results are somewhat better:

# fio --name=registry-read --ioengine=libaio --rw=randread --bs=4k 
--numjobs=4 --size=1G --runtime=60 --group_reporting --iodepth=16
registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
(T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.33
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=45.4MiB/s][r=11.6k IOPS][eta 00m:00s]
registry-read: (groupid=0, jobs=4): err= 0: pid=24282: Thu Mar 20 
13:18:23 2025
  read: IOPS=11.6k, BW=45.5MiB/s (47.7MB/s)(2730MiB/60001msec)
    slat (usec): min=110, max=21206, avg=341.21, stdev=79.69
    clat (nsec): min=1390, max=42395k, avg=5147009.08, stdev=475506.40
     lat (usec): min=335, max=42779, avg=5488.22, stdev=498.03
    clat percentiles (usec):
     |  1.00th=[ 4621],  5.00th=[ 4752], 10.00th=[ 4817], 20.00th=[ 4948],
     | 30.00th=[ 5014], 40.00th=[ 5080], 50.00th=[ 5080], 60.00th=[ 5145],
     | 70.00th=[ 5211], 80.00th=[ 5276], 90.00th=[ 5407], 95.00th=[ 5538],
     | 99.00th=[ 6194], 99.50th=[ 6783], 99.90th=[ 9765], 99.95th=[12125],
     | 99.99th=[24249]
   bw (  KiB/s): min=36434, max=48352, per=100.00%, avg=46612.18, 
stdev=300.09, samples=476
   iops        : min= 9108, max=12088, avg=11653.04, stdev=75.03, 
samples=476
  lat (usec)   : 2=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=99.90%, 20=0.08%, 50=0.01%
  cpu          : usr=0.98%, sys=4.18%, ctx=706399, majf=0, minf=99
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued rwts: total=698956,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=45.5MiB/s (47.7MB/s), 45.5MiB/s-45.5MiB/s 
(47.7MB/s-47.7MB/s), io=2730MiB (2863MB), run=60001-60001msec

I think we're limited by the IO thread. I suggest you try multiple disks 
with SCSI Virtio single.

My VM conf:
agent: 1
boot: order=scsi0;ide2;net0
cores: 2
cpu: x86-64-v3
ide2: none,media=cdrom
memory: 2048
meta: creation-qemu=9.0.2,ctime=1739888364
name: elacunza-btrfs-test
net0: virtio=BC:24:11:47:9B:58,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: proxmox_r3_ssd2:vm-112-disk-0,discard=on,iothread=1,size=15G
scsihw: virtio-scsi-single
smbios1: uuid=263ab229-4379-4abf-b6bf-615b98ccd3d4
sockets: 1
vmgenid: 13b7f2a4-2a42-4600-845a-da88f96ae6e8

I think this is a KVM/QEMU issue, not a Ceph issue :) Maybe you can get 
better suggestions in pve-user mailing list.

Cheers

El 20/3/25 a las 12:29, Giovanna Ratini escribió:
Hello Eneko,

this is my configuration. The performance is similar across all VMs. I 
am now checking GitLab, as that is where people are complaining the most.

agent: 1
balloon: 65000
bios: ovmf
boot: order=scsi0;net0
cores: 10
cpu: host
efidisk0: cephvm:vm-6506-disk-0,efitype=4m,size=528K
memory: 130000
meta: creation-qemu=9.0.2,ctime=1734995123
name: gitlab02
net0: virtio=BC:24:11:6E:28:71,bridge=vmbr1,firewall=1
numa: 0
ostype: l26
scsi0: 
cephvm:vm-6506-disk-1,aio=native,cache=writeback,iothread=1,size=64G,ssd=1
scsi1: 
cephvm:vm-6506-disk-2,aio=native,cache=writeback,iothread=1,size=10T,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=0a5294c0-c82a-40f2-aae4-f5880022a2ac
sockets: 2
vmgenid: ea610fde-6c71-4b7f-9257-fa431a428e16

Cheers,

Gio

Am 20.03.2025 um 10:23 schrieb Eneko Lacunza:
Hi Giovanna,

Can you post VM's full config?

Also, can you test with IO thread enabled and SCSI virtio single, and 
multiple disks?

Cheers

El 19/3/25 a las 17:27, Giovanna Ratini escribió:

hello Eneko,

Yes I did.  No significant changes.  :-(
Cheers,

Gio

Am Mittwoch, März 19, 2025 13:09 CET, schrieb Eneko Lacunza 
<elacunza@xxxxxxxxx>:

Hi Giovanna,

Have you tried increasing iothreads option for the VM?

Cheers

El 18/3/25 a las 19:13, Giovanna Ratini escribió:
> Hello Antony,
>
> no, no QoS applied to Vms.
>
> The Server has PCIe Gen 4
>
> ceph osd dump | grep pool
> pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash
> rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 21 flags
> hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
> read_balance_score 13.04
> pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
> last_change 598 lfor 0/598/596 flags hashpspool stripe_width 0
> application cephfs read_balance_score 2.02
> pool 3 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
> last_change 50 flags hashpspool stripe_width 0 pg_autoscale_bias 4
> pg_num_min 16 recovery_priority 5 application cephfs
> read_balance_score 2.42
> pool 4 'cephvm' replicated size 3 min_size 2 crush_rule 0 
object_hash
> rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 16386
> lfor 0/644/2603 flags hashpspool,selfmanaged_snaps stripe_width 0
> application rbd read_balance_score 1.52
>
> I think, this is the default config. 🙈
>
> I will search for my chassies supermicro upgrade.
>
> Thank you
>
>
> Am 18.03.2025 um 17:57 schrieb Anthony D'Atri:
>>> Then I tested on the *Proxmox host*, and the results were
>>> significantly better.
>> My Proxmox prowess is limited, but from my experience with other
>> virtualization platforms, I have to ask if there is any QoS
>> throttling applied to VMs.  With OpenStack or DO there is often 
IOPS
>> and/or throughput throttling via libvirt to mitigate noisy 
neighbors.
>>
>>>   fio --name=host-test --filename=/dev/rbd0 --ioengine=libaio
>>> --rw=randread --bs=4k --numjobs=4 --iodepth=32 --size=1G
>>> --runtime=60 --group_reporting
>>>
>>> *IOPS*: *1.54M*
>>>
>>> # *Bandwidth*: *6032MiB/s (6325MB/s)*
>>> # *Latency*:
>>>
>>> * *Avg*: *39.8µs*
>>> * *99.9th percentile*: *71µs*
>>>
>>> # *CPU Usage*: *usr=22.60%, sys=77.13%*
>>> #
>>>
>>> Am 18.03.2025 um 15:27 schrieb Anthony D'Atri:
>>>> Which NVMe drive SKUs specifically?
>>> # */dev/nvme6n1* – *KCD61LUL15T3* – 15.36 TB – SN: 6250A02QT5A8
>>> # */dev/nvme5n1* – *KCD61LUL15T3* – 15.36 TB – SN: 42R0A036T5A8
>>> # */dev/nvme4n1* – *KCD61LUL15T3* – 15.36 TB – SN: 6250A02UT5A8
>> Kioxia CD6.  If you were using client-class drives all manner of
>> performance issues would be expected.
>>
>> Is your server chassis at least PCIe Gen 4?  If it’s Gen 3 that may
>> hamper these drives.
>>
>> Also, how many of these are in your cluster?  If it’s a small 
number
>> you might still benefit from chopping each into at least 2 separate
>> OSDs.
>>
>> And please send `ceph osd dump | grep pool`, having too few PGs
>> wouldn’t do you any favors.
>>
>>
>>>> Are you running a recent kernel?
>>> penultimate: 6.8.12-8-pve (VM, yes)
>> Groovy.  If you were running like a CentOS 6 or CentOS 7 kernel 
then
>> NVMe issues might be expected as old kernels had rudimentary NVMe
>> support.
>>
>>>>   Have you updated firmware on the NVMe devices?
>>> No.
>> Kioxia appears to not release firmware updates publicly but your
>> chassis brand (Dell, HP, SMCI, etc) might have an update.
>> 
e.g.https://www.dell.com/support/home/en-vc/drivers/driversdetails?driverid=7ny55 

>>
>>
>>   If there is an available update I would strongly suggest 
applying.
>
>>
>>> Thanks again,
>>>
>>> best regards,
>>> Gio
>>>
>>> _______________________________________________
>>> ceph-users mailing list --ceph-users@xxxxxxx
>>> To unsubscribe send an email toceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 <tel:+34 943 569 206> | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

     EnekoLacunza

Director Técnico | Zuzendari teknikoa

Binovo IT Human Project

    943 569 206 <tel:943 569 206>

elacunza@xxxxxxxxx <mailto:elacunza@xxxxxxxxx>

    binovo.es <//binovo.es>

    Astigarragako Bidea, 2 - 2 izda. Oficina 10-11, 20180 Oiartzun

youtube <https://www.youtube.com/user/CANALBINOVO/>
    linkedin <https://www.linkedin.com/company/37269706/>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx