Re: Inherited CEPH nightmare

Tino Todino <tinot@xxxxxxxxxxxxxxxxx> · Tue, 11 Oct 2022 14:55:04 +0000

Hi Janne,

I've changed some elements of the config now and the results are much better but still quite poor relative to what I would consider normal SSD performance.  

The osd_memory_target is now set to 12GB for 3 of the 4 hosts (each of these hosts has 1.5TB RAM so I can allocate loads if necessary).  The other host is a more modern small form factor server with only 64GB on board, so each of the 3 OSDs in that device has 4GB per OSD.  The number of PGs has been increased from 128 to 256.  Not yet run JJ Balancer.

In terms of performance, I measured the time it takes for ProxMox to clone a 127GB VM. It now clones in around 18 minutes, rather than 1 hour 55 mins before the config changes, so there is progress here.

I also had a play around with enabling and disabling write cache.  I performed a rudimentary ceph tell osd.x bench command to see what the performance would be with it on/off.  The results were surprising as the disks provided far more IOPS with the cache ENABLED, rather than disabled.

To round out your question, we are on Bluestore, with CEPH v 16.2.7

I still think the next steps are to change the remaining 6 consumer grade devices with the Seagate IronWolf 125 1TB SSD's which seem to perform much better according to ceph benchmarks, and after that increase the number of hosts to 6, and spread the 12 OSD's so that each host has 2 OSD's only.

Any other suggestions are welcome.

Many thanks.

Current ceph.conf:

[global]
	 auth_client_required = cephx
	 auth_cluster_required = cephx
	 auth_service_required = cephx
	 cluster_network = 192.168.8.4/24
	 fsid = 4a4b4fff-d140-4e11-a35b-cbac0e18a3ce
	 mon_allow_pool_delete = true
	 mon_host = 192.168.8.5 192.168.8.3 192.168.8.6
	 ms_bind_ipv4 = true
	 ms_bind_ipv6 = false
	 osd_memory_target = 12884901888
	 osd_pool_default_min_size = 2
	 osd_pool_default_size = 3
	 public_network = 192.168.8.4/24

[client]
	 keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
	 keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.cl1-h1-lv]
	 host = cl1-h1-lv
	 mds_standby_for_name = pve

[mds.cl1-h2-lv]
	 host = cl1-h2-lv
	 mds_standby_for_name = pve

[mds.cl1-h3-lv]
	 host = cl1-h3-lv
	 mds_standby_for_name = pve

[mon.cl1-h1-lv]
	 public_addr = 192.168.8.3

[mon.cl1-h3-lv]
	 public_addr = 192.168.8.5

[mon.cl1-h4-lv]
	 public_addr = 192.168.8.6

And crush map:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host cl1-h2-lv {
	id -3		# do not change unnecessarily
	id -4 class ssd		# do not change unnecessarily
	# weight 2.729
	alg straw2
	hash 0	# rjenkins1
	item osd.0 weight 0.910
	item osd.5 weight 0.910
	item osd.10 weight 0.910
}
host cl1-h3-lv {
	id -5		# do not change unnecessarily
	id -6 class ssd		# do not change unnecessarily
	# weight 2.729
	alg straw2
	hash 0	# rjenkins1
	item osd.1 weight 0.910
	item osd.6 weight 0.910
	item osd.11 weight 0.910
}
host cl1-h4-lv {
	id -7		# do not change unnecessarily
	id -8 class ssd		# do not change unnecessarily
	# weight 2.729
	alg straw2
	hash 0	# rjenkins1
	item osd.7 weight 0.910
	item osd.2 weight 0.910
	item osd.3 weight 0.910
}
host cl1-h1-lv {
	id -9		# do not change unnecessarily
	id -10 class ssd		# do not change unnecessarily
	# weight 2.729
	alg straw2
	hash 0	# rjenkins1
	item osd.4 weight 0.910
	item osd.9 weight 0.910
	item osd.12 weight 0.910
}
root default {
	id -1		# do not change unnecessarily
	id -2 class ssd		# do not change unnecessarily
	# weight 10.916
	alg straw2
	hash 0	# rjenkins1
	item cl1-h2-lv weight 2.729
	item cl1-h3-lv weight 2.729
	item cl1-h4-lv weight 2.729
	item cl1-h1-lv weight 2.729
}

# rules
rule replicated_rule {
	id 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type host
	step emit
}

# end crush map

Ceph -s output:

root@cl1-h1-lv:~# ceph -s
  cluster:
    id:     4a4b4fff-d140-4e11-a35b-cbac0e18a3ce
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum cl1-h3-lv,cl1-h1-lv,cl1-h4-lv (age 3d)
    mgr: cl1-h3-lv(active, since 11w), standbys: cl1-h2-lv, cl1-h1-lv
    mds: 1/1 daemons up, 2 standby
    osd: 12 osds: 12 up (since 3d), 12 in (since 3d)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 305 pgs
    objects: 647.02k objects, 2.4 TiB
    usage:   7.2 TiB used, 3.7 TiB / 11 TiB avail
    pgs:     305 active+clean

  io:
    client:   96 KiB/s rd, 409 KiB/s wr, 7 op/s rd, 38 op/s wr

ceph osd df output:

root@cl1-h1-lv:~# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP      META     AVAIL    %USE   VAR   PGS  STATUS
 4    ssd  0.90970   1.00000  932 GiB  635 GiB  632 GiB   1.1 MiB  2.5 GiB  297 GiB  68.12  1.03   79      up
 9    ssd  0.90970   1.00000  932 GiB  643 GiB  640 GiB    62 MiB  2.1 GiB  289 GiB  68.98  1.05   81      up
12    ssd  0.90970   1.00000  932 GiB  576 GiB  574 GiB  1007 KiB  2.1 GiB  355 GiB  61.87  0.94   70      up
 0    ssd  0.90970   1.00000  932 GiB  643 GiB  641 GiB   1.1 MiB  2.2 GiB  288 GiB  69.05  1.05   80      up
 5    ssd  0.90970   1.00000  932 GiB  595 GiB  593 GiB   1.0 MiB  2.5 GiB  336 GiB  63.91  0.97   70      up
10    ssd  0.90970   1.00000  932 GiB  585 GiB  583 GiB   1.6 MiB  2.4 GiB  346 GiB  62.82  0.95   74      up
 1    ssd  0.90970   1.00000  932 GiB  597 GiB  595 GiB   1.0 MiB  2.2 GiB  334 GiB  64.10  0.97   69      up
 6    ssd  0.90970   1.00000  932 GiB  652 GiB  649 GiB    62 MiB  2.4 GiB  280 GiB  69.94  1.06   85      up
11    ssd  0.90970   1.00000  932 GiB  587 GiB  584 GiB  1016 KiB  2.5 GiB  345 GiB  62.98  0.95   72      up
 2    ssd  0.90970   1.00000  932 GiB  605 GiB  603 GiB    62 MiB  2.1 GiB  326 GiB  64.96  0.98   79      up
 3    ssd  0.90970   1.00000  932 GiB  645 GiB  643 GiB   1.1 MiB  1.9 GiB  287 GiB  69.23  1.05   82      up
 7    ssd  0.90970   1.00000  932 GiB  615 GiB  612 GiB   1.2 MiB  2.6 GiB  317 GiB  65.99  1.00   74      up
                       TOTAL   11 TiB  7.2 TiB  7.2 TiB   196 MiB   28 GiB  3.7 TiB  66.00
MIN/MAX VAR: 0.94/1.06  STDDEV: 2.80

-----Original Message-----
From: Janne Johansson <icepic.dz@xxxxxxxxx> 
Sent: 10 October 2022 07:52
To: Tino Todino <tinot@xxxxxxxxxxxxxxxxx>
Cc: ceph-users@xxxxxxx
Subject: Re:  Inherited CEPH nightmare

>             osd_memory_target = 2147483648
>
> Based on some reading, I'm starting to understand a little about what can be tweaked. For example, I think the osd_memory_target looks low.  I also think the DB/WAL should be on dedicated disks or partitions, but have no idea what procedure to follow to do this.  I'm actually thinking that the best bet would be to copy the VM's to temporary storage (as there is only about 7TBs worth) and then set-up CEPH from scratch following some kind of best practice guide.

Yes, the memory target is very low, if you have RAM to spare, bumping this to 4-6-8-10G for each OSD should give some speedups.
If you can, check one of each drive type to see if they gain or lose from having write-cache turned off, as per

https://medium.com/coccoc-engineering-blog/performance-impact-of-write-cache-for-hard-solid-state-disk-drives-755d01fcce61

and other guides. The ceph usage pattern combined with some less-than-optimal ssd caches sometimes force much more to get flushed when ceph wants to make sure a small part actually hits the disk, meaning you get poor iops rates. Unfortunately this is very dependent on the controllers and the drives, so there is no simple rule if on or off is "best" for all possible combinations, but the fio test shown on that and similar pages should tell you quickly if you can get 50-100% more write iops out of your drives by having the cache in the right mode for each type of disk. Hopefully bumped ram should help with read performance, so it should be able to get better perf by two relatively simple changes.

Check if any OSDs are bluestore, and if not, convert each filestore OSD to bluestore, that would probably give you 50% more write iops on that OSD.

https://www.virtualtothecore.com/how-to-migrate-ceph-storage-volumes-from-filestore-to-bluestore/

They probably are bluestore, but it can't hurt to check if the cluster is old.

--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx