Re: Experience with 100G Ceph in Proxmox

Giovanna Ratini <giovanna.ratini@xxxxxxxxxxxxxxx> · Fri, 21 Mar 2025 13:13:44 +0100

Hello Alvaro,

They are /not/ behind a traditional hardware RAID controller.

No RAID controller is present — I'm working with native NVMe SSDs on 
individual PCIe lanes.

Cheers,

Gio

Am 21.03.2025 um 09:27 schrieb Alvaro Soto:
Digging into a different direction. I have question, are the drives
connected a raid array card? And how are they presented?

I don't recall where did I read something about a raid card presenting
drives to the kernel as SCSI instead of NVMe and the queue depth being the
issue.

Cheers.
--

Alvaro Soto

Note: My work hours may not be your work hours. Please do not feel the need
to respond during a time that is not convenient for you.
----------------------------------------------------------
Great people talk about ideas,
ordinary people talk about things,
small people talk... about other people

On Thu, Mar 20, 2025, 2:13 PM Giovanna Ratini <
giovanna.ratini@xxxxxxxxxxxxxxx> wrote:

Hello,

Yes, I will test KRBD. I will be on holiday next week, so I don’t want
to make any changes before then.

Could you wait until 29.3?

This is a production environment, and restoring a backup would take
time. Or do you think there’s no risk in making the change without concern?

Thank you,

best Regards,
Gio

Am 20.03.2025 um 16:57 schrieb Eneko Lacunza:
Hi Chris,

I tried KRBD, even with a newly created disk and after shuting down
and starting VM again, but no measurable difference.

Our Ceph is 18.2.4, that may be a factor to consider, but 9k -> 273k?!

Maybe Giovanna can test KRBD option and report back... :)

Cheers

El 20/3/25 a las 16:19, Chris Palmer escribió:
HI Eneko

No containers. In the Promox console go to Datacenter\Storage, click
on the storage you are using, then Edit. There is a tick box KRBD.
With that set, any virtual disks created in that storage will use
KRBD rather than librbd. So it applies to all VMs that use that storage.

Chris

On 20/03/2025 15:00, Eneko Lacunza wrote:
Chris, you tested from a container? Or how do you configure a KRBD
disk for a VM?

El 20/3/25 a las 15:15, Chris Palmer escribió:
I just ran that command on one of my VMs. Salient details:

   * Ceph cluster 19.2.1 with 3 nodes, 4 x SATA disks with shared NVMe
     DB/WAL, single 10g NICs
   * Promox 8.3.5 cluster with 2 nodes (separate nodes to Ceph), single
     10g NICs , single 1g NICs for corosync
   * Test VM was using KRBD R3 pool on HDD, iothread=1, aio=io_uring,
     cache=writeback

The results are very different:

# fio --name=registry-read --ioengine=libaio --rw=randread --bs=4k
--numjobs=4 --size=1G --runtime=60 --group_reporting --iodepth=16
registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W)
4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.37
Starting 4 processes
Jobs: 4 (f=4): [r(4)][-.-%][r=1080MiB/s][r=277k IOPS][eta 00m:00s]
registry-read: (groupid=0, jobs=4): err= 0: pid=13355: Thu Mar 20
13:57:05 2025
   read: IOPS=273k, BW=1068MiB/s (1120MB/s)(4096MiB/3835msec)
     slat (usec): min=7, max=3802, avg=13.77, stdev= 6.41
     clat (nsec): min=599, max=4395.1k, avg=215298.68, stdev=38131.71
      lat (usec): min=11, max=4408, avg=229.07, stdev=40.01
     clat percentiles (usec):
      |  1.00th=[  194],  5.00th=[  200], 10.00th=[  202],
20.00th=[  204],
      | 30.00th=[  206], 40.00th=[  208], 50.00th=[  210],
60.00th=[  212],
      | 70.00th=[  215], 80.00th=[  217], 90.00th=[  227],
95.00th=[  243],
      | 99.00th=[  367], 99.50th=[  420], 99.90th=[  594],
99.95th=[  668],
      | 99.99th=[  963]
    bw (  MiB/s): min=  920, max= 1118, per=100.00%, avg=1068.04,
stdev=16.81, samples=28
    iops        : min=235566, max=286286, avg=273417.14,
stdev=4303.79, samples=28
   lat (nsec)   : 750=0.01%, 1000=0.01%
   lat (usec)   : 20=0.01%, 50=0.01%, 100=0.01%, 250=96.06%, 500=3.67%
   lat (usec)   : 750=0.24%, 1000=0.02%
   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
   cpu          : usr=4.68%, sys=29.99%, ctx=1048987, majf=0, minf=102
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%,
32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%,
64=0.0%, >=64=0.0%
      issued rwts: total=1048576,0,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
    READ: bw=1068MiB/s (1120MB/s), 1068MiB/s-1068MiB/s
(1120MB/s-1120MB/s), io=4096MiB (4295MB), run=3835-3835msec

Disk stats (read/write):
   sdc: ios=999346/0, sectors=7994768/0, merge=0/0, ticks=10360/0,
in_queue=10361, util=95.49%

On 20/03/2025 12:23, Eneko Lacunza wrote:
Hi Giovanna,

I just tested one of my VMs:
# fio --name=registry-read --ioengine=libaio --rw=randread --bs=4k
--numjobs=4 --size=1G --runtime=60 --group_reporting
registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W)
4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W)
4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.33
Starting 4 processes
registry-read: Laying out IO file (1 file / 1024MiB)
registry-read: Laying out IO file (1 file / 1024MiB)
registry-read: Laying out IO file (1 file / 1024MiB)
registry-read: Laying out IO file (1 file / 1024MiB)
Jobs: 4 (f=0): [f(4)][100.0%][r=33.5MiB/s][r=8578 IOPS][eta 00m:00s]
registry-read: (groupid=0, jobs=4): err= 0: pid=24261: Thu Mar 20
12:57:26 2025
   read: IOPS=8538, BW=33.4MiB/s (35.0MB/s)(2001MiB/60001msec)
     slat (usec): min=309, max=4928, avg=464.54, stdev=73.15
     clat (nsec): min=602, max=1532.4k, avg=1999.15, stdev=3724.16
      lat (usec): min=310, max=4931, avg=466.54, stdev=73.36
     clat percentiles (nsec):
      |  1.00th=[  812],  5.00th=[  884], 10.00th=[  940],
20.00th=[ 1096],
      | 30.00th=[ 1368], 40.00th=[ 1576], 50.00th=[ 1720],
60.00th=[ 1832],
      | 70.00th=[ 1944], 80.00th=[ 2096], 90.00th=[ 2480],
95.00th=[ 3024],
      | 99.00th=[12480], 99.50th=[15808], 99.90th=[47360],
99.95th=[61696],
      | 99.99th=[90624]
    bw (  KiB/s): min=30448, max=35868, per=100.00%, avg=34155.76,
stdev=269.75, samples=476
    iops        : min= 7612, max= 8966, avg=8538.87, stdev=67.43,
samples=476
   lat (nsec)   : 750=0.06%, 1000=14.94%
   lat (usec)   : 2=59.18%, 4=23.07%, 10=1.28%, 20=1.17%, 50=0.21%
   lat (usec)   : 100=0.08%, 250=0.01%, 500=0.01%
   lat (msec)   : 2=0.01%
   cpu          : usr=1.04%, sys=5.50%, ctx=537639, majf=0, minf=36
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%,
32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%, >=64=0.0%
      issued rwts: total=512316,0,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
    READ: bw=33.4MiB/s (35.0MB/s), 33.4MiB/s-33.4MiB/s
(35.0MB/s-35.0MB/s), io=2001MiB (2098MB), run=60001-60001msec

Results are worse than yours, but this is on a production (not
very busy) pool with 4x3.84TB SATA disks (4 disks total vs ~15
disks in your case) and 10G network.

VM cpu is x86_64_v3 and host CPU Ryzen 1700.

I gest almost the same IOPS with --iodepth=16 .

I tried moving the VM to a Ryzen 5900X and results are somewhat
better:

# fio --name=registry-read --ioengine=libaio --rw=randread --bs=4k
--numjobs=4 --size=1G --runtime=60 --group_reporting --iodepth=16
registry-read: (g=0): rw=randread, bs=(R) 4096B-4096B, (W)
4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.33
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=45.4MiB/s][r=11.6k IOPS][eta 00m:00s]
registry-read: (groupid=0, jobs=4): err= 0: pid=24282: Thu Mar 20
13:18:23 2025
   read: IOPS=11.6k, BW=45.5MiB/s (47.7MB/s)(2730MiB/60001msec)
     slat (usec): min=110, max=21206, avg=341.21, stdev=79.69
     clat (nsec): min=1390, max=42395k, avg=5147009.08,
stdev=475506.40
      lat (usec): min=335, max=42779, avg=5488.22, stdev=498.03
     clat percentiles (usec):
      |  1.00th=[ 4621],  5.00th=[ 4752], 10.00th=[ 4817],
20.00th=[ 4948],
      | 30.00th=[ 5014], 40.00th=[ 5080], 50.00th=[ 5080],
60.00th=[ 5145],
      | 70.00th=[ 5211], 80.00th=[ 5276], 90.00th=[ 5407],
95.00th=[ 5538],
      | 99.00th=[ 6194], 99.50th=[ 6783], 99.90th=[ 9765],
99.95th=[12125],
      | 99.99th=[24249]
    bw (  KiB/s): min=36434, max=48352, per=100.00%, avg=46612.18,
stdev=300.09, samples=476
    iops        : min= 9108, max=12088, avg=11653.04, stdev=75.03,
samples=476
   lat (usec)   : 2=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
   lat (msec)   : 2=0.01%, 4=0.01%, 10=99.90%, 20=0.08%, 50=0.01%
   cpu          : usr=0.98%, sys=4.18%, ctx=706399, majf=0, minf=99
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%,
32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%,
64=0.0%, >=64=0.0%
      issued rwts: total=698956,0,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
    READ: bw=45.5MiB/s (47.7MB/s), 45.5MiB/s-45.5MiB/s
(47.7MB/s-47.7MB/s), io=2730MiB (2863MB), run=60001-60001msec

I think we're limited by the IO thread. I suggest you try multiple
disks with SCSI Virtio single.

My VM conf:
agent: 1
boot: order=scsi0;ide2;net0
cores: 2
cpu: x86-64-v3
ide2: none,media=cdrom
memory: 2048
meta: creation-qemu=9.0.2,ctime=1739888364
name: elacunza-btrfs-test
net0: virtio=BC:24:11:47:9B:58,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: proxmox_r3_ssd2:vm-112-disk-0,discard=on,iothread=1,size=15G
scsihw: virtio-scsi-single
smbios1: uuid=263ab229-4379-4abf-b6bf-615b98ccd3d4
sockets: 1
vmgenid: 13b7f2a4-2a42-4600-845a-da88f96ae6e8

I think this is a KVM/QEMU issue, not a Ceph issue :) Maybe you
can get better suggestions in pve-user mailing list.

Cheers

El 20/3/25 a las 12:29, Giovanna Ratini escribió:
Hello Eneko,

this is my configuration. The performance is similar across all
VMs. I am now checking GitLab, as that is where people are
complaining the most.

agent: 1
balloon: 65000
bios: ovmf
boot: order=scsi0;net0
cores: 10
cpu: host
efidisk0: cephvm:vm-6506-disk-0,efitype=4m,size=528K
memory: 130000
meta: creation-qemu=9.0.2,ctime=1734995123
name: gitlab02
net0: virtio=BC:24:11:6E:28:71,bridge=vmbr1,firewall=1
numa: 0
ostype: l26
scsi0:

cephvm:vm-6506-disk-1,aio=native,cache=writeback,iothread=1,size=64G,ssd=1
scsi1:

cephvm:vm-6506-disk-2,aio=native,cache=writeback,iothread=1,size=10T,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=0a5294c0-c82a-40f2-aae4-f5880022a2ac
sockets: 2
vmgenid: ea610fde-6c71-4b7f-9257-fa431a428e16

Cheers,

Gio

Am 20.03.2025 um 10:23 schrieb Eneko Lacunza:
Hi Giovanna,

Can you post VM's full config?

Also, can you test with IO thread enabled and SCSI virtio
single, and multiple disks?

Cheers

El 19/3/25 a las 17:27, Giovanna Ratini escribió:
hello Eneko,

Yes I did.  No significant changes.  :-(
Cheers,

Gio

Am Mittwoch, März 19, 2025 13:09 CET, schrieb Eneko Lacunza
<elacunza@xxxxxxxxx>:

Hi Giovanna,

Have you tried increasing iothreads option for the VM?

Cheers

El 18/3/25 a las 19:13, Giovanna Ratini escribió:
Hello Antony,

no, no QoS applied to Vms.

The Server has PCIe Gen 4

ceph osd dump | grep pool
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0
object_hash
rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 21
flags
hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1
application mgr
read_balance_score 13.04
pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
last_change 598 lfor 0/598/596 flags hashpspool stripe_width 0
application cephfs read_balance_score 2.02
pool 3 'cephfs_metadata' replicated size 3 min_size 2
crush_rule 0
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
last_change 50 flags hashpspool stripe_width 0
pg_autoscale_bias 4
pg_num_min 16 recovery_priority 5 application cephfs
read_balance_score 2.42
pool 4 'cephvm' replicated size 3 min_size 2 crush_rule 0
object_hash
rjenkins pg_num 128 pgp_num 128 autoscale_mode on
last_change 16386
lfor 0/644/2603 flags hashpspool,selfmanaged_snaps
stripe_width 0
application rbd read_balance_score 1.52

I think, this is the default config. 🙈

I will search for my chassies supermicro upgrade.

Thank you

Am 18.03.2025 um 17:57 schrieb Anthony D'Atri:
Then I tested on the *Proxmox host*, and the results were
significantly better.
My Proxmox prowess is limited, but from my experience with
other
virtualization platforms, I have to ask if there is any QoS
throttling applied to VMs.  With OpenStack or DO there is
often IOPS
and/or throughput throttling via libvirt to mitigate noisy
neighbors.
   fio --name=host-test --filename=/dev/rbd0 --ioengine=libaio
--rw=randread --bs=4k --numjobs=4 --iodepth=32 --size=1G
--runtime=60 --group_reporting

*IOPS*: *1.54M*

# *Bandwidth*: *6032MiB/s (6325MB/s)*
# *Latency*:

* *Avg*: *39.8µs*
* *99.9th percentile*: *71µs*

# *CPU Usage*: *usr=22.60%, sys=77.13%*
#

Am 18.03.2025 um 15:27 schrieb Anthony D'Atri:
Which NVMe drive SKUs specifically?
# */dev/nvme6n1* – *KCD61LUL15T3* – 15.36 TB – SN:
6250A02QT5A8
# */dev/nvme5n1* – *KCD61LUL15T3* – 15.36 TB – SN:
42R0A036T5A8
# */dev/nvme4n1* – *KCD61LUL15T3* – 15.36 TB
<https://www.google.com/maps/search/CD61LUL15T3*+%E2%80%93+15.36+TB+?entry=gmail&source=g>–
SN:
6250A02UT5A8
Kioxia CD6.  If you were using client-class drives all
manner of
performance issues would be expected.

Is your server chassis at least PCIe Gen 4? If it’s Gen 3
that may
hamper these drives.

Also, how many of these are in your cluster? If it’s a
small number
you might still benefit from chopping each into at least 2
separate
OSDs.

And please send `ceph osd dump | grep pool`, having too few
PGs
wouldn’t do you any favors.

Are you running a recent kernel?
penultimate: 6.8.12-8-pve (VM, yes)
Groovy.  If you were running like a CentOS 6 or CentOS 7
kernel then
NVMe issues might be expected as old kernels had
rudimentary NVMe
support.

   Have you updated firmware on the NVMe
<https://www.google.com/maps/search/updated+firmware+on+the+NVMe?entry=gmail&source=g>
devices?
No.
Kioxia appears to not release firmware updates publicly but
your
chassis brand (Dell, HP, SMCI, etc) might have an update.

e.g.
https://www.dell.com/support/home/en-vc/drivers/driversdetails?driverid=7ny55

   If there is an available update I would strongly suggest
applying.
Thanks again,

best regards,
Gio

_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an emailtoceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206<tel:+34 943 569 206> |
https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx

      EnekoLacunza

Director Técnico | Zuzendari teknikoa

Binovo IT Human Project

     943 569 206<tel:943 569 206>

elacunza@xxxxxxxxx <mailto:elacunza@xxxxxxxxx>

     binovo.es <//binovo.es>

     Astigarragako Bidea, 2 - 2 izda. Oficina 10-11, 20180 Oiartzun

youtube<https://www.youtube.com/user/CANALBINOVO/>
     linkedin<https://www.linkedin.com/company/37269706/>
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx