Inherited CEPH nightmare

Tino Todino <tinot@xxxxxxxxxxxxxxxxx> · Fri, 7 Oct 2022 14:56:29 +0000

Hi folks,

The company I recently joined has a Proxmox cluster of 4 hosts with a CEPH implementation that was set-up using the Proxmox GUI.  It is running terribly, and as a CEPH newbie I'm trying to figure out if the configuration is at fault.  I'd really appreciate some help and guidance on this please.

The symptoms:

  *   Really slow read/write performance
  *   Really Really slow rebalancing/backfill
  *   High Apply/Commit latency on a couple of the SSDs when under load
  *   Knock on performance hit on key VM's (particularly AD/DNS services) that affect user experience

The setup is as follows:

4 x hosts, 3 hosts are Dell R820s which have 4 socket Xeon's with 96 cores and 1.5 TB RAM.  The other (host 4) has a Ryzen 7 5800 processor with 64GB RAM.  All servers are running on a simple 10Gbe network with dedicated NICs on a separate subnet.

The SSD's in use are a combination of new Seagate IronWolf 125 1TB SSDs and older Crucial MX500 1TB, and WDC Blue 1TB drives.  I know some of these are consumer class, but I'm working on replacing these.

I believe the OSDs were added to ProxMox's CEPH implementation with the default settings, i.e DB and WAL on the same OSD.  All 4 hosts are set as Monitors, and the 3 beefy ones set as Managers and metadata servers.

Ceph version is 16.2.7

Here is the config:

[global]
            auth_client_required = cephx
            auth_cluster_required = cephx
            auth_service_required = cephx
            cluster_network = 192.168.8.4/24
            fsid = 4a4b4fff-d140-4e11-a35b-cbac0e18a3ce
            mon_allow_pool_delete = true
            mon_host = 192.168.8.4 192.168.8.6 192.168.8.5 192.168.8.3
            ms_bind_ipv4 = true
            ms_bind_ipv6 = false
            osd_memory_target = 2147483648
            osd_pool_default_min_size = 2
            osd_pool_default_size = 3
            public_network = 192.168.8.4/24

[client]
            keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
            keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.cl1-h1-lv]
            host = cl1-h1-lv
            mds_standby_for_name = pve

[mds.cl1-h2-lv]
            host = cl1-h2-lv
            mds_standby_for_name = pve

[mds.cl1-h3-lv]
            host = cl1-h3-lv
            mds_standby_for_name = pve

[mon.cl1-h1-lv]
            public_addr = 192.168.8.3

[mon.cl1-h2-lv]
            public_addr = 192.168.8.4

[mon.cl1-h3-lv]
            public_addr = 192.168.8.5

[mon.cl1-h4-lv]
            public_addr = 192.168.8.6

And the Crush map:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host cl1-h2-lv {
            id -3                  # do not change unnecessarily
            id -4 class ssd               # do not change unnecessarily
            # weight 2.729
            alg straw2
            hash 0  # rjenkins1
            item osd.0 weight 0.910
            item osd.5 weight 0.910
            item osd.10 weight 0.910
}
host cl1-h3-lv {
            id -5                  # do not change unnecessarily
            id -6 class ssd               # do not change unnecessarily
            # weight 2.729
            alg straw2
            hash 0  # rjenkins1
            item osd.1 weight 0.910
            item osd.6 weight 0.910
            item osd.11 weight 0.910
}
host cl1-h4-lv {
            id -7                  # do not change unnecessarily
            id -8 class ssd               # do not change unnecessarily
            # weight 1.819
            alg straw2
            hash 0  # rjenkins1
            item osd.7 weight 0.910
            item osd.2 weight 0.910
}
host cl1-h1-lv {
            id -9                  # do not change unnecessarily
            id -10 class ssd             # do not change unnecessarily
            # weight 3.639
            alg straw2
            hash 0  # rjenkins1
            item osd.4 weight 0.910
            item osd.8 weight 0.910
            item osd.9 weight 0.910
            item osd.12 weight 0.910
}
root default {
            id -1                  # do not change unnecessarily
            id -2 class ssd               # do not change unnecessarily
            # weight 10.916
            alg straw2
            hash 0  # rjenkins1
            item cl1-h2-lv weight 2.729
            item cl1-h3-lv weight 2.729
            item cl1-h4-lv weight 1.819
            item cl1-h1-lv weight 3.639
}

# rules
rule replicated_rule {
            id 0
            type replicated
            min_size 1
            max_size 10
            step take default
            step chooseleaf firstn 0 type host
            step emit
}

# end crush map

Based on some reading, I'm starting to understand a little about what can be tweaked. For example, I think the osd_memory_target looks low.  I also think the DB/WAL should be on dedicated disks or partitions, but have no idea what procedure to follow to do this.  I'm actually thinking that the best bet would be to copy the VM's to temporary storage (as there is only about 7TBs worth) and then set-up CEPH from scratch following some kind of best practice guide.

Anyway, any help would be gratefully received.

Thanks for reading.

Kind regards
Tino Todino

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx