On Mon, May 29, 2017 at 4:58 AM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> wrote: > > Hello all, > > We have a ceph cluster with 72 OSDs distributed on 6 hosts, in 3 chassis. In > our crush map the we are distributing the PGs on chassis (complete crush map > below): > > # rules > rule replicated_ruleset { > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type chassis > step emit > } > > We had a host failure, and I can see that ceph is using 2 OSDs from the same > chassis for a lot of the remapped PGs. Even worse, I can see that there are > cases when a PG is using two OSDs from the same host like here: > > 3.5f6 37 0 4 37 0 149446656 3040 3040 > active+remapped 2017-05-26 11:29:23.122820 61820'222074 61820:158025 > [52,39] 52 [52,39,3] 52 61488'198356 2017-05-23 > 23:51:56.210597 61488'198356 2017-05-23 23:51:56.210597 > > I have tis in the log: > 2017-05-26 11:26:53.244424 osd.52 10.12.193.69:6801/7044 1510 : cluster > [INF] 3.5f6 restarting backfill on osd.39 from (0'0,0'0] MAX to 61488'203000 > > What can be wrong? It's not clear from the output you've provided whether your pools have size 2 or 3. From what you've shown, I'm guessing you have size 2, and the OSD failure prompted a move of the PG in question away from OSD 3 to OSD 39. Since 39 doesn't have any of the data yet, OSD 3 is being maintained in the acting set to maintain redundancy, but it will go away one the backfill is done. In general, it's a failure of CRUSH's design goals if you see moves of the replica within buckets which didn't experience failure, but they do sometimes happen. There have been a lot of improvements over the years to reduce how often that happens, some of which are supported by Hammer but not on by default (because it prevents use of older clients), some of which are only in very new code like the Luminous dev releases. I suspect you'd find things behave better under your cluster if you upgrade to Jewel and set the CRUSH flags it recommends to you. -Greg > > > Our crush map looks like this: > > # begin crush map > tunable choose_local_tries 0 > tunable choose_local_fallback_tries 0 > tunable choose_total_tries 50 > tunable chooseleaf_descend_once 1 > tunable straw_calc_version 1 > > # devices > device 0 osd.0 > device 1 osd.1 > device 2 osd.2 > device 3 osd.3 > .... > device 69 osd.69 > device 70 osd.70 > device 71 osd.71 > > # types > type 0 osd > type 1 host > type 2 chassis > type 3 rack > type 4 row > type 5 pdu > type 6 pod > type 7 room > type 8 datacenter > type 9 region > type 10 root > > # buckets > host tv-c1-al01 { > id -7 # do not change unnecessarily > # weight 21.840 > alg straw > hash 0 # rjenkins1 > item osd.5 weight 1.820 > item osd.11 weight 1.820 > item osd.17 weight 1.820 > item osd.23 weight 1.820 > item osd.29 weight 1.820 > item osd.35 weight 1.820 > item osd.41 weight 1.820 > item osd.47 weight 1.820 > item osd.53 weight 1.820 > item osd.59 weight 1.820 > item osd.65 weight 1.820 > item osd.71 weight 1.820 > } > host tv-c1-al02 { > id -3 # do not change unnecessarily > # weight 21.840 > alg straw > hash 0 # rjenkins1 > item osd.1 weight 1.820 > item osd.7 weight 1.820 > item osd.13 weight 1.820 > item osd.19 weight 1.820 > item osd.25 weight 1.820 > item osd.31 weight 1.820 > item osd.37 weight 1.820 > item osd.43 weight 1.820 > item osd.49 weight 1.820 > item osd.55 weight 1.820 > item osd.61 weight 1.820 > item osd.67 weight 1.820 > } > chassis tv-c1 { > id -8 # do not change unnecessarily > # weight 43.680 > alg straw > hash 0 # rjenkins1 > item tv-c1-al01 weight 21.840 > item tv-c1-al02 weight 21.840 > } > host tv-c2-al01 { > id -5 # do not change unnecessarily > # weight 21.840 > alg straw > hash 0 # rjenkins1 > item osd.3 weight 1.820 > item osd.9 weight 1.820 > item osd.15 weight 1.820 > item osd.21 weight 1.820 > item osd.27 weight 1.820 > item osd.33 weight 1.820 > item osd.39 weight 1.820 > item osd.45 weight 1.820 > item osd.51 weight 1.820 > item osd.57 weight 1.820 > item osd.63 weight 1.820 > item osd.70 weight 1.820 > } > host tv-c2-al02 { > id -2 # do not change unnecessarily > # weight 21.840 > alg straw > hash 0 # rjenkins1 > item osd.0 weight 1.820 > item osd.6 weight 1.820 > item osd.12 weight 1.820 > item osd.18 weight 1.820 > item osd.24 weight 1.820 > item osd.30 weight 1.820 > item osd.36 weight 1.820 > item osd.42 weight 1.820 > item osd.48 weight 1.820 > item osd.54 weight 1.820 > item osd.60 weight 1.820 > item osd.66 weight 1.820 > } > chassis tv-c2 { > id -9 # do not change unnecessarily > # weight 43.680 > alg straw > hash 0 # rjenkins1 > item tv-c2-al01 weight 21.840 > item tv-c2-al02 weight 21.840 > } > host tv-c1-al03 { > id -6 # do not change unnecessarily > # weight 21.840 > alg straw > hash 0 # rjenkins1 > item osd.4 weight 1.820 > item osd.10 weight 1.820 > item osd.16 weight 1.820 > item osd.22 weight 1.820 > item osd.28 weight 1.820 > item osd.34 weight 1.820 > item osd.40 weight 1.820 > item osd.46 weight 1.820 > item osd.52 weight 1.820 > item osd.58 weight 1.820 > item osd.64 weight 1.820 > item osd.69 weight 1.820 > } > host tv-c2-al03 { > id -4 # do not change unnecessarily > # weight 21.840 > alg straw > hash 0 # rjenkins1 > item osd.2 weight 1.820 > item osd.8 weight 1.820 > item osd.14 weight 1.820 > item osd.20 weight 1.820 > item osd.26 weight 1.820 > item osd.32 weight 1.820 > item osd.38 weight 1.820 > item osd.44 weight 1.820 > item osd.50 weight 1.820 > item osd.56 weight 1.820 > item osd.62 weight 1.820 > item osd.68 weight 1.820 > } > chassis tv-c3 { > id -10 # do not change unnecessarily > # weight 43.680 > alg straw > hash 0 # rjenkins1 > item tv-c1-al03 weight 21.840 > item tv-c2-al03 weight 21.840 > } > root default { > id -1 # do not change unnecessarily > # weight 131.040 > alg straw > hash 0 # rjenkins1 > item tv-c1 weight 43.680 > item tv-c2 weight 43.680 > item tv-c3 weight 43.680 > } > > # rules > rule replicated_ruleset { > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type chassis > step emit > } > > # end crush map > > > Thank you, > Laszlo > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com