Hi All, We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't do very well :( We ended up with 98% of PGs as down. This setup has 2 data centers defined, with 4 copies across both, and a minimum of size of 1. We have 1 mon/mgr in each DC, with one in a 3rd data center connected to each of the other 2 by VPN. When I did a pg query on the PG's that were stuck it said they were blocked from coming up because they couldn't contact 2 of the OSDs (located in the other data center that it was unable to reach).. but the other 2 were fine. I'm at a loss because this was exactly the thing we thought we had set it up to prevent.. and with size = 4 and min_size set = 1 I understood that it would continue without a problem? :( Crush map is below .. if anyone has any ideas? I would sincerely appreciate it :) Thanks! Dale # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable straw_calc_version 1 # devices device 0 osd.0 class ssd device 1 osd.1 class ssd device 2 osd.2 class ssd device 3 osd.3 class ssd device 4 osd.4 class ssd device 5 osd.5 class ssd device 6 osd.6 class ssd device 7 osd.7 class ssd device 8 osd.8 class ssd device 9 osd.9 class ssd device 10 osd.10 class ssd device 11 osd.11 class ssd device 12 osd.12 class ssd device 13 osd.13 class ssd device 14 osd.14 class ssd device 15 osd.15 class ssd device 16 osd.16 class ssd device 17 osd.17 class ssd device 18 osd.18 class ssd device 19 osd.19 class ssd device 20 osd.20 class ssd device 21 osd.21 class ssd device 22 osd.22 class ssd device 23 osd.23 class ssd device 24 osd.24 class ssd device 25 osd.25 class ssd device 26 osd.26 class ssd device 27 osd.27 class ssd device 28 osd.28 class ssd device 29 osd.29 class ssd device 30 osd.30 class ssd device 31 osd.31 class ssd device 32 osd.32 class ssd device 33 osd.33 class ssd device 34 osd.34 class ssd device 35 osd.35 class ssd device 36 osd.36 class ssd device 37 osd.37 class ssd device 38 osd.38 class ssd device 39 osd.39 class ssd device 40 osd.40 class ssd device 41 osd.41 class ssd device 42 osd.42 class ssd device 43 osd.43 class ssd device 44 osd.44 class ssd device 45 osd.45 class ssd device 46 osd.46 class ssd device 47 osd.47 class ssd device 49 osd.49 class ssd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host Pnode01 { id -8 # do not change unnecessarily id -9 class ssd # do not change unnecessarily # weight 0.000 alg straw2 hash 0 # rjenkins1 } host node01 { id -2 # do not change unnecessarily id -15 class ssd # do not change unnecessarily # weight 14.537 alg straw2 hash 0 # rjenkins1 item osd.4 weight 1.817 item osd.1 weight 1.817 item osd.3 weight 1.817 item osd.2 weight 1.817 item osd.6 weight 1.817 item osd.9 weight 1.817 item osd.5 weight 1.817 item osd.0 weight 1.818 } host node02 { id -3 # do not change unnecessarily id -16 class ssd # do not change unnecessarily # weight 14.536 alg straw2 hash 0 # rjenkins1 item osd.10 weight 1.817 item osd.11 weight 1.817 item osd.12 weight 1.817 item osd.13 weight 1.817 item osd.14 weight 1.817 item osd.15 weight 1.817 item osd.16 weight 1.817 item osd.19 weight 1.817 } host node03 { id -4 # do not change unnecessarily id -17 class ssd # do not change unnecessarily # weight 14.536 alg straw2 hash 0 # rjenkins1 item osd.20 weight 1.817 item osd.21 weight 1.817 item osd.22 weight 1.817 item osd.23 weight 1.817 item osd.25 weight 1.817 item osd.26 weight 1.817 item osd.29 weight 1.817 item osd.24 weight 1.817 } datacenter EDM1 { id -11 # do not change unnecessarily id -14 class ssd # do not change unnecessarily # weight 43.609 alg straw hash 0 # rjenkins1 item node01 weight 14.537 item node02 weight 14.536 item node03 weight 14.536 } host node04 { id -5 # do not change unnecessarily id -18 class ssd # do not change unnecessarily # weight 14.536 alg straw2 hash 0 # rjenkins1 item osd.30 weight 1.817 item osd.31 weight 1.817 item osd.32 weight 1.817 item osd.33 weight 1.817 item osd.34 weight 1.817 item osd.35 weight 1.817 item osd.36 weight 1.817 item osd.39 weight 1.817 } host node05 { id -6 # do not change unnecessarily id -19 class ssd # do not change unnecessarily # weight 14.536 alg straw2 hash 0 # rjenkins1 item osd.40 weight 1.817 item osd.41 weight 1.817 item osd.42 weight 1.817 item osd.43 weight 1.817 item osd.44 weight 1.817 item osd.45 weight 1.817 item osd.46 weight 1.817 item osd.49 weight 1.817 } host node06 { id -7 # do not change unnecessarily id -20 class ssd # do not change unnecessarily # weight 16.353 alg straw2 hash 0 # rjenkins1 item osd.47 weight 1.817 item osd.37 weight 1.817 item osd.27 weight 1.817 item osd.38 weight 1.817 item osd.7 weight 1.817 item osd.28 weight 1.817 item osd.8 weight 1.817 item osd.17 weight 1.817 item osd.18 weight 1.817 } datacenter EDM3 { id -12 # do not change unnecessarily id -13 class ssd # do not change unnecessarily # weight 45.425 alg straw hash 0 # rjenkins1 item node04 weight 14.536 item node05 weight 14.536 item node06 weight 16.353 } datacenter EDM2 { id -10 # do not change unnecessarily id -22 class ssd # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 } root default { id -1 # do not change unnecessarily id -21 class ssd # do not change unnecessarily # weight 89.034 alg straw2 hash 0 # rjenkins1 item Pnode01 weight 0.000 item EDM1 weight 43.609 item EDM3 weight 45.425 item EDM2 weight 0.000 } # rules rule replicated_ruleset { id 0 type replicated min_size 1 max_size 10 step take default step choose firstn 2 type datacenter step chooseleaf firstn 2 type host step emit } _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx