Bad/strange performance on a new cluster

Jan <dorfpinguin+ceph@xxxxxxxxx> · Wed, 22 Jan 2025 20:50:45 +0000

Hi all,

we have a new Ceph cluster and have some very bad/strange performance 
behavior.

I really don't understand what I'm doing wrong here and would be more 
than happy if anyone has an idea.
Even a hint on what to look at would be helpful.

Some Information:

Machines (8 Nodes) per Node:

- CPU 2x Intel(R) Xeon(R) Gold 6258R CPU @ 2.70GHz (28 Cores)
- 384 GB RAM
- 20x Dell Ent NVMe AGN RI U.2 7.68TB (for OSDs)
- 4x 25G LACP Backend
- 2x 25G LACP Frontend

- OS:
    - Ubuntu 22.04
    - Kernel: 5.15.0
- Ceph:
    - Version 18.2.4
    - 160 osds
    - 4096 PGs for the VM pool

I took some fio benchmarks from the Proxmox Ceph Performance Paper:
https://www.proxmox.com/images/download/pve/docs/Proxmox-VE_Ceph-Benchmark-202009-rev2.pdf

The First test should have about 1500 IOPS (Proxmox Paper: 1806).
We only get 170.

root@ceph001:/mnt# fio --ioengine=psync --filename=test_fio --size=9G 
--time_based --name=fio --group_reporting --runtime=60 --direct=1 
--sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=psync, iodepth=1
fio-3.28
Starting 1 process
fio: Laying out IO file (1 file / 9216MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=680KiB/s][w=170 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=174797: Wed Jan 22 20:19:19 2025
  write: IOPS=202, BW=811KiB/s (831kB/s)(47.5MiB/60003msec); 0 zone resets
    clat (usec): min=2185, max=20081, avg=4925.43, stdev=931.63
     lat (usec): min=2186, max=20082, avg=4926.19, stdev=931.63
    clat percentiles (usec):
     |  1.00th=[ 3425],  5.00th=[ 3818], 10.00th=[ 3982], 20.00th=[ 4293],
     | 30.00th=[ 4490], 40.00th=[ 4686], 50.00th=[ 4817], 60.00th=[ 5014],
     | 70.00th=[ 5211], 80.00th=[ 5407], 90.00th=[ 5800], 95.00th=[ 6063],
     | 99.00th=[ 8586], 99.50th=[ 9503], 99.90th=[12256], 99.95th=[13304],
     | 99.99th=[19006]
   bw (  KiB/s): min=  672, max= 1000, per=100.00%, avg=813.11, 
stdev=73.18, samples=119
   iops        : min=  168, max=  250, avg=203.28, stdev=18.29, samples=119
  lat (msec)   : 4=10.24%, 10=89.43%, 20=0.32%, 50=0.01%
  cpu          : usr=0.25%, sys=2.57%, ctx=36503, majf=0, minf=15
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued rwts: total=0,12167,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=811KiB/s (831kB/s), 811KiB/s-811KiB/s (831kB/s-831kB/s), 
io=47.5MiB (49.8MB), run=60003-60003msec

Disk stats (read/write):
  rbd0: ios=0/24296, merge=0/2, ticks=0/56351, in_queue=56351, util=99.97%

Bandwith and IOPs with more IO depth look ok form me:

fio --filename=/mnt/testingfio1 --size=50GB --direct=1 --rw=randrw 
--bs=4k --ioengine=libaio --iodepth=256 --runtime=150 --numjobs=1 
--time_based \
--group_reporting --name=iops-test-job --eta-newline=1

iops-test-job: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
(T) 4096B-4096B, ioengine=libaio, iodepth=256
fio-3.28
iops-test-job: (groupid=0, jobs=1): err= 0: pid=146931: Wed Jan 22 
19:43:14 2025
  read: IOPS=20.0k, BW=78.0MiB/s (81.8MB/s)(11.4GiB/150006msec)
    slat (nsec): min=1245, max=7415.7k, avg=22636.97, stdev=224525.03

    clat (usec): min=238, max=32714, avg=5620.53, stdev=2255.97
     lat (usec): min=243, max=32721, avg=5643.32, stdev=2258.28
    clat percentiles (usec):
     |  1.00th=[ 1876],  5.00th=[ 2311], 10.00th=[ 2671], 20.00th=[ 3654],
     | 30.00th=[ 4146], 40.00th=[ 4752], 50.00th=[ 5342], 60.00th=[ 6128],
     | 70.00th=[ 6915], 80.00th=[ 7635], 90.00th=[ 8717], 95.00th=[ 9896],
     | 99.00th=[10683], 99.50th=[10945], 99.90th=[11863], 99.95th=[12649],
     | 99.99th=[14615]
   bw (  KiB/s): min=63254, max=98432, per=100.00%, avg=79914.04, 
stdev=6290.69, samples=299
   iops        : min=15813, max=24608, avg=19978.36, stdev=1572.70, 
samples=299
  write: IOPS=19.9k, BW=77.9MiB/s (81.7MB/s)(11.4GiB/150006msec); 0 
zone resets
    slat (nsec): min=1349, max=8871.7k, avg=23250.80, stdev=225370.53
    clat (usec): min=629, max=81108, avg=7160.58, stdev=2338.33
     lat (usec): min=633, max=81114, avg=7183.98, stdev=2348.98
    clat percentiles (usec):
     |  1.00th=[ 2900],  5.00th=[ 3982], 10.00th=[ 4293], 20.00th=[ 5014],
     | 30.00th=[ 5735], 40.00th=[ 6325], 50.00th=[ 6980], 60.00th=[ 7570],
     | 70.00th=[ 8225], 80.00th=[ 9110], 90.00th=[10421], 95.00th=[11207],
     | 99.00th=[13435], 99.50th=[14353], 99.90th=[16581], 99.95th=[17957],
     | 99.99th=[21365]
   bw (  KiB/s): min=61755, max=98813, per=100.00%, avg=79877.64, 
stdev=6336.82, samples=299
   iops        : min=15438, max=24703, avg=19969.22, stdev=1584.22, 
samples=299
  lat (usec)   : 250=0.01%, 500=0.03%, 750=0.07%, 1000=0.08%
    clat (usec): min=238, max=32714, avg=5620.53, stdev=2255.97
                                    [141/1761]
     lat (usec): min=243, max=32721, avg=5643.32, stdev=2258.28

    clat percentiles (usec):

     |  1.00th=[ 1876],  5.00th=[ 2311], 10.00th=[ 2671], 20.00th=[ 3654],
     | 30.00th=[ 4146], 40.00th=[ 4752], 50.00th=[ 5342], 60.00th=[ 6128],
     | 70.00th=[ 6915], 80.00th=[ 7635], 90.00th=[ 8717], 95.00th=[ 9896],
     | 99.00th=[10683], 99.50th=[10945], 99.90th=[11863], 99.95th=[12649],
     | 99.99th=[14615]

   bw (  KiB/s): min=63254, max=98432, per=100.00%, avg=79914.04, 
stdev=6290.69, samples=299
   iops        : min=15813, max=24608, avg=19978.36, stdev=1572.70, 
samples=299
  write: IOPS=19.9k, BW=77.9MiB/s (81.7MB/s)(11.4GiB/150006msec); 0 
zone resets
    slat (nsec): min=1349, max=8871.7k, avg=23250.80, stdev=225370.53

    clat (usec): min=629, max=81108, avg=7160.58, stdev=2338.33

     lat (usec): min=633, max=81114, avg=7183.98, stdev=2348.98

    clat percentiles (usec):

     |  1.00th=[ 2900],  5.00th=[ 3982], 10.00th=[ 4293], 20.00th=[ 5014],
     | 30.00th=[ 5735], 40.00th=[ 6325], 50.00th=[ 6980], 60.00th=[ 7570],
     | 70.00th=[ 8225], 80.00th=[ 9110], 90.00th=[10421], 95.00th=[11207],
     | 99.00th=[13435], 99.50th=[14353], 99.90th=[16581], 99.95th=[17957],
     | 99.99th=[21365]

   bw (  KiB/s): min=61755, max=98813, per=100.00%, avg=79877.64, 
stdev=6336.82, samples=299
   iops        : min=15438, max=24703, avg=19969.22, stdev=1584.22, 
samples=299
  lat (usec)   : 250=0.01%, 500=0.03%, 750=0.07%, 1000=0.08%

  lat (msec)   : 2=0.65%, 4=15.15%, 10=75.45%, 20=8.55%, 50=0.01%

  lat (msec)   : 100=0.01%

  cpu          : usr=7.36%, sys=18.55%, ctx=155494, majf=0, minf=9200

  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
>=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.1%
     issued rwts: total=2993949,2992429,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=78.0MiB/s (81.8MB/s), 78.0MiB/s-78.0MiB/s 
(81.8MB/s-81.8MB/s), io=11.4GiB (12.3GB), run=150006-150006msec
  WRITE: bw=77.9MiB/s (81.7MB/s), 77.9MiB/s-77.9MiB/s 
(81.7MB/s-81.7MB/s), io=11.4GiB (12.3GB), run=150006-150006msec

Disk stats (read/write):
  rbd0: ios=2989470/2987987, merge=0/1, ticks=9760844/13043854, 
in_queue=22804699, util=100.00%

We have 4096 PG on the tested pool.

root@ceph001:/mnt# ceph -s
  cluster:
    id:
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum ceph001,ceph002,ceph003,ceph005,ceph006 (age 
52m)
    mgr: ceph002.hgppdu(active, since 2d), standbys: ceph001.ooznoq
    osd: 160 osds: 160 up (since 5w), 160 in (since 5M)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    pools:   11 pools, 8449 pgs
    objects: 7.80M objects, 19 TiB
    usage:   56 TiB used, 1.0 PiB / 1.1 PiB avail

root@ceph001:~# ceph config get osd osd_memory_target
4294967296

root@ceph001:~# ceph config get osd
WHO     MASK  LEVEL     OPTION                      VALUE 
                                        ...
osd           advanced  osd_memory_target_autotune  true
...

We would like to use the cluster with OpenStack Cinder. But the tests 
were made directly on the cluster nodes with rbd. The metrics on VMs are 
similar.

Thanks in advance.

Jan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx