Hi Janne, I've changed some elements of the config now and the results are much better but still quite poor relative to what I would consider normal SSD performance. The osd_memory_target is now set to 12GB for 3 of the 4 hosts (each of these hosts has 1.5TB RAM so I can allocate loads if necessary). The other host is a more modern small form factor server with only 64GB on board, so each of the 3 OSDs in that device has 4GB per OSD. The number of PGs has been increased from 128 to 256. Not yet run JJ Balancer. In terms of performance, I measured the time it takes for ProxMox to clone a 127GB VM. It now clones in around 18 minutes, rather than 1 hour 55 mins before the config changes, so there is progress here. I also had a play around with enabling and disabling write cache. I performed a rudimentary ceph tell osd.x bench command to see what the performance would be with it on/off. The results were surprising as the disks provided far more IOPS with the cache ENABLED, rather than disabled. To round out your question, we are on Bluestore, with CEPH v 16.2.7 I still think the next steps are to change the remaining 6 consumer grade devices with the Seagate IronWolf 125 1TB SSD's which seem to perform much better according to ceph benchmarks, and after that increase the number of hosts to 6, and spread the 12 OSD's so that each host has 2 OSD's only. Any other suggestions are welcome. Many thanks. Current ceph.conf: [global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 192.168.8.4/24 fsid = 4a4b4fff-d140-4e11-a35b-cbac0e18a3ce mon_allow_pool_delete = true mon_host = 192.168.8.5 192.168.8.3 192.168.8.6 ms_bind_ipv4 = true ms_bind_ipv6 = false osd_memory_target = 12884901888 osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 192.168.8.4/24 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [mds] keyring = /var/lib/ceph/mds/ceph-$id/keyring [mds.cl1-h1-lv] host = cl1-h1-lv mds_standby_for_name = pve [mds.cl1-h2-lv] host = cl1-h2-lv mds_standby_for_name = pve [mds.cl1-h3-lv] host = cl1-h3-lv mds_standby_for_name = pve [mon.cl1-h1-lv] public_addr = 192.168.8.3 [mon.cl1-h3-lv] public_addr = 192.168.8.5 [mon.cl1-h4-lv] public_addr = 192.168.8.6 And crush map: # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class ssd device 1 osd.1 class ssd device 2 osd.2 class ssd device 3 osd.3 class ssd device 4 osd.4 class ssd device 5 osd.5 class ssd device 6 osd.6 class ssd device 7 osd.7 class ssd device 9 osd.9 class ssd device 10 osd.10 class ssd device 11 osd.11 class ssd device 12 osd.12 class ssd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 zone type 10 region type 11 root # buckets host cl1-h2-lv { id -3 # do not change unnecessarily id -4 class ssd # do not change unnecessarily # weight 2.729 alg straw2 hash 0 # rjenkins1 item osd.0 weight 0.910 item osd.5 weight 0.910 item osd.10 weight 0.910 } host cl1-h3-lv { id -5 # do not change unnecessarily id -6 class ssd # do not change unnecessarily # weight 2.729 alg straw2 hash 0 # rjenkins1 item osd.1 weight 0.910 item osd.6 weight 0.910 item osd.11 weight 0.910 } host cl1-h4-lv { id -7 # do not change unnecessarily id -8 class ssd # do not change unnecessarily # weight 2.729 alg straw2 hash 0 # rjenkins1 item osd.7 weight 0.910 item osd.2 weight 0.910 item osd.3 weight 0.910 } host cl1-h1-lv { id -9 # do not change unnecessarily id -10 class ssd # do not change unnecessarily # weight 2.729 alg straw2 hash 0 # rjenkins1 item osd.4 weight 0.910 item osd.9 weight 0.910 item osd.12 weight 0.910 } root default { id -1 # do not change unnecessarily id -2 class ssd # do not change unnecessarily # weight 10.916 alg straw2 hash 0 # rjenkins1 item cl1-h2-lv weight 2.729 item cl1-h3-lv weight 2.729 item cl1-h4-lv weight 2.729 item cl1-h1-lv weight 2.729 } # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map Ceph -s output: root@cl1-h1-lv:~# ceph -s cluster: id: 4a4b4fff-d140-4e11-a35b-cbac0e18a3ce health: HEALTH_OK services: mon: 3 daemons, quorum cl1-h3-lv,cl1-h1-lv,cl1-h4-lv (age 3d) mgr: cl1-h3-lv(active, since 11w), standbys: cl1-h2-lv, cl1-h1-lv mds: 1/1 daemons up, 2 standby osd: 12 osds: 12 up (since 3d), 12 in (since 3d) data: volumes: 1/1 healthy pools: 4 pools, 305 pgs objects: 647.02k objects, 2.4 TiB usage: 7.2 TiB used, 3.7 TiB / 11 TiB avail pgs: 305 active+clean io: client: 96 KiB/s rd, 409 KiB/s wr, 7 op/s rd, 38 op/s wr ceph osd df output: root@cl1-h1-lv:~# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 4 ssd 0.90970 1.00000 932 GiB 635 GiB 632 GiB 1.1 MiB 2.5 GiB 297 GiB 68.12 1.03 79 up 9 ssd 0.90970 1.00000 932 GiB 643 GiB 640 GiB 62 MiB 2.1 GiB 289 GiB 68.98 1.05 81 up 12 ssd 0.90970 1.00000 932 GiB 576 GiB 574 GiB 1007 KiB 2.1 GiB 355 GiB 61.87 0.94 70 up 0 ssd 0.90970 1.00000 932 GiB 643 GiB 641 GiB 1.1 MiB 2.2 GiB 288 GiB 69.05 1.05 80 up 5 ssd 0.90970 1.00000 932 GiB 595 GiB 593 GiB 1.0 MiB 2.5 GiB 336 GiB 63.91 0.97 70 up 10 ssd 0.90970 1.00000 932 GiB 585 GiB 583 GiB 1.6 MiB 2.4 GiB 346 GiB 62.82 0.95 74 up 1 ssd 0.90970 1.00000 932 GiB 597 GiB 595 GiB 1.0 MiB 2.2 GiB 334 GiB 64.10 0.97 69 up 6 ssd 0.90970 1.00000 932 GiB 652 GiB 649 GiB 62 MiB 2.4 GiB 280 GiB 69.94 1.06 85 up 11 ssd 0.90970 1.00000 932 GiB 587 GiB 584 GiB 1016 KiB 2.5 GiB 345 GiB 62.98 0.95 72 up 2 ssd 0.90970 1.00000 932 GiB 605 GiB 603 GiB 62 MiB 2.1 GiB 326 GiB 64.96 0.98 79 up 3 ssd 0.90970 1.00000 932 GiB 645 GiB 643 GiB 1.1 MiB 1.9 GiB 287 GiB 69.23 1.05 82 up 7 ssd 0.90970 1.00000 932 GiB 615 GiB 612 GiB 1.2 MiB 2.6 GiB 317 GiB 65.99 1.00 74 up TOTAL 11 TiB 7.2 TiB 7.2 TiB 196 MiB 28 GiB 3.7 TiB 66.00 MIN/MAX VAR: 0.94/1.06 STDDEV: 2.80 -----Original Message----- From: Janne Johansson <icepic.dz@xxxxxxxxx> Sent: 10 October 2022 07:52 To: Tino Todino <tinot@xxxxxxxxxxxxxxxxx> Cc: ceph-users@xxxxxxx Subject: Re: Inherited CEPH nightmare > osd_memory_target = 2147483648 > > Based on some reading, I'm starting to understand a little about what can be tweaked. For example, I think the osd_memory_target looks low. I also think the DB/WAL should be on dedicated disks or partitions, but have no idea what procedure to follow to do this. I'm actually thinking that the best bet would be to copy the VM's to temporary storage (as there is only about 7TBs worth) and then set-up CEPH from scratch following some kind of best practice guide. Yes, the memory target is very low, if you have RAM to spare, bumping this to 4-6-8-10G for each OSD should give some speedups. If you can, check one of each drive type to see if they gain or lose from having write-cache turned off, as per https://medium.com/coccoc-engineering-blog/performance-impact-of-write-cache-for-hard-solid-state-disk-drives-755d01fcce61 and other guides. The ceph usage pattern combined with some less-than-optimal ssd caches sometimes force much more to get flushed when ceph wants to make sure a small part actually hits the disk, meaning you get poor iops rates. Unfortunately this is very dependent on the controllers and the drives, so there is no simple rule if on or off is "best" for all possible combinations, but the fio test shown on that and similar pages should tell you quickly if you can get 50-100% more write iops out of your drives by having the cache in the right mode for each type of disk. Hopefully bumped ram should help with read performance, so it should be able to get better perf by two relatively simple changes. Check if any OSDs are bluestore, and if not, convert each filestore OSD to bluestore, that would probably give you 50% more write iops on that OSD. https://www.virtualtothecore.com/how-to-migrate-ceph-storage-volumes-from-filestore-to-bluestore/ They probably are bluestore, but it can't hurt to check if the cluster is old. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx