Hi Matthew, To make a simplistic comparison, it is generally not recommended to raid 5 with large disks (>1 TB) due to the probability (low but not zero) of losing another disk during the rebuild. So imagine losing a host full of disks. Additionally, min_size=1 means you can no longer maintain your cluster (update, etc.), it's dangerous. Unless you can afford to lose/rebuild your cluster, you should never have a min_size <2 ________________________________________________________ Cordialement, *David CASIER* ________________________________________________________ Le mar. 5 déc. 2023 à 10:03, duluxoz <duluxoz@xxxxxxxxx> a écrit : > Thanks David, I knew I had something wrong :-) > > Just for my own edification: Why is k=2, m=1 not recommended for > production? Considered to "fragile", or something else? > > Cheers > > Dulux-Oz > > On 05/12/2023 19:53, David Rivera wrote: > > First problem here is you are using crush-failure-domain=osd when you > > should use crush-failure-domain=host. With three hosts, you should use > > k=2, m=1; this is not recommended in production environment. > > > > On Mon, Dec 4, 2023, 23:26 duluxoz <duluxoz@xxxxxxxxx> wrote: > > > > Hi All, > > > > Looking for some help/explanation around erasure code pools, etc. > > > > I set up a 3-node Ceph (Quincy) cluster with each box holding 7 OSDs > > (HDDs) and each box running Monitor, Manager, and iSCSI Gateway. > > For the > > record the cluster runs beautifully, without resource issues, etc. > > > > I created an Erasure Code Profile, etc: > > > > ~~~ > > ceph osd erasure-code-profile set my_ec_profile plugin=jerasure > > k=4 m=2 > > crush-failure-domain=osd > > ceph osd crush rule create-erasure my_ec_rule my_ec_profile > > ceph osd crush rule create-replicated my_replicated_rule default host > > ~~~ > > > > My Crush Map is: > > > > ~~~ > > # begin crush map > > tunable choose_local_tries 0 > > tunable choose_local_fallback_tries 0 > > tunable choose_total_tries 50 > > tunable chooseleaf_descend_once 1 > > tunable chooseleaf_vary_r 1 > > tunable chooseleaf_stable 1 > > tunable straw_calc_version 1 > > tunable allowed_bucket_algs 54 > > > > # devices > > device 0 osd.0 class hdd > > device 1 osd.1 class hdd > > device 2 osd.2 class hdd > > device 3 osd.3 class hdd > > device 4 osd.4 class hdd > > device 5 osd.5 class hdd > > device 6 osd.6 class hdd > > device 7 osd.7 class hdd > > device 8 osd.8 class hdd > > device 9 osd.9 class hdd > > device 10 osd.10 class hdd > > device 11 osd.11 class hdd > > device 12 osd.12 class hdd > > device 13 osd.13 class hdd > > device 14 osd.14 class hdd > > device 15 osd.15 class hdd > > device 16 osd.16 class hdd > > device 17 osd.17 class hdd > > device 18 osd.18 class hdd > > device 19 osd.19 class hdd > > device 20 osd.20 class hdd > > > > # types > > type 0 osd > > type 1 host > > type 2 chassis > > type 3 rack > > type 4 row > > type 5 pdu > > type 6 pod > > type 7 room > > type 8 datacenter > > type 9 zone > > type 10 region > > type 11 root > > > > # buckets > > host ceph_1 { > > id -3 # do not change unnecessarily > > id -4 class hdd # do not change unnecessarily > > # weight 38.09564 > > alg straw2 > > hash 0 # rjenkins1 > > item osd.0 weight 5.34769 > > item osd.1 weight 5.45799 > > item osd.2 weight 5.45799 > > item osd.3 weight 5.45799 > > item osd.4 weight 5.45799 > > item osd.5 weight 5.45799 > > item osd.6 weight 5.45799 > > } > > host ceph_2 { > > id -5 # do not change unnecessarily > > id -6 class hdd # do not change unnecessarily > > # weight 38.09564 > > alg straw2 > > hash 0 # rjenkins1 > > item osd.7 weight 5.34769 > > item osd.8 weight 5.45799 > > item osd.9 weight 5.45799 > > item osd.10 weight 5.45799 > > item osd.11 weight 5.45799 > > item osd.12 weight 5.45799 > > item osd.13 weight 5.45799 > > } > > host ceph_3 { > > id -7 # do not change unnecessarily > > id -8 class hdd # do not change unnecessarily > > # weight 38.09564 > > alg straw2 > > hash 0 # rjenkins1 > > item osd.14 weight 5.34769 > > item osd.15 weight 5.45799 > > item osd.16 weight 5.45799 > > item osd.17 weight 5.45799 > > item osd.18 weight 5.45799 > > item osd.19 weight 5.45799 > > item osd.20 weight 5.45799 > > } > > root default { > > id -1 # do not change unnecessarily > > id -2 class hdd # do not change unnecessarily > > # weight 114.28693 > > alg straw2 > > hash 0 # rjenkins1 > > item ceph_1 weight 38.09564 > > item ceph_2 weight 38.09564 > > item ceph_3 weight 38.09564 > > } > > > > # rules > > rule replicated_rule { > > id 0 > > type replicated > > step take default > > step chooseleaf firstn 0 type host > > step emit > > } > > rule my_replicated_rule { > > id 1 > > type replicated > > step take default > > step chooseleaf firstn 0 type host > > step emit > > } > > rule my_ec_rule { > > id 2 > > type erasure > > step set_chooseleaf_tries 5 > > step set_choose_tries 100 > > step take default > > step choose indep 3 type host > > step chooseleaf indep 2 type osd > > step emit > > } > > > > # end crush map > > ~~~ > > > > Finally I create a pool: > > > > ~~~ > > ceph osd pool create my_pool 32 32 erasure my_ec_profile my_ec_rule > > ceph osd pool application enable my_meta_pool rbd > > rbd pool init my_meta_pool > > rbd pool init my_pool > > rbd create --size 16T my_pool/my_disk_1 --data-pool my_pool > > --image-feature journaling > > ~~~ > > > > So all this is to have some VMs (oVirt VMs, for the record) with > > automatic fall-over in the case of a Ceph Node loss - ie I was > > trying to > > "replicate" a 3-Disk RAID 5 array across the Ceph Nodes, so that I > > could > > loose a Node and still have a working set of VMs. > > > > However, I took one of the Ceph Nodes down (gracefully) for some > > maintenance the other day and I lost *all* the VMs (ie oVirt > > complained > > that there was no active pool). As soon as I brought the down node > > back > > up everything was good again. > > > > So my question is: What did I do wrong with my config? > > > > Sound I, for example, change the EC Profile to `k=2, m=1`, but how is > > that practically different from `k=4, m=2` - yes, the later > > spreads the > > pool over more disks, but it should still only put 2 disks on each > > node, > > shouldn't it? > > > > Thanks in advance > > > > Cheers > > > > Dulux-Oz > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx