Re: strange remap on host failure

Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> · Tue, 30 May 2017 22:57:43 +0300

I agree with you that the crush map is changing all the time, because of the changes in the cluster. Our problem is that it did not changed as expected in this host failure situation.

Kind regards,
Laszlo

On 30.05.2017 21:28, David Turner wrote:
Adding osds and nodes to a cluster changes the crush map, an osd being marked out changes the crush map, an osd being removed from the cluster changes the crush map... The crush map changes all the time even if you aren't modifying it directly.

On Tue, May 30, 2017 at 2:08 PM Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx <mailto:laszlo@xxxxxxxxxxxxxxxx>> wrote:

    we have not touched the crush map. We have only observed that the cluster is not responding as expected to a failure, and we wonder why. As I've mentioned in the previous post, we were able to reproduce the situation on a different ceph cluster so I've filled in a bug report.

    So far this is what we have. Any ideas that would help to move forward with this situation are welcome.

    Kind regards,
    Laszlo

    On 30.05.2017 19:47, David Turner wrote:
     > When you lose a host, the entire CRUSH map is affected.  Any change to the crush map can affect any PG, OSD, host, or failure domain in the entire cluster.  If you modified osd.10's weight in the crush map by increasing it by 0.5, you would likely see PGs in the entire cluster moving around, not just PGs going onto and moving off of osd.10.  Would that match what you're seeing?
     >
     > You can do a full `ceph pg dump` and see if any of the PGs are showing that they are reflecting that they are running on multiple OSDs inside of the same failure domain.
     >
     > On Tue, May 30, 2017 at 12:34 PM Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx <mailto:laszlo@xxxxxxxxxxxxxxxx> <mailto:laszlo@xxxxxxxxxxxxxxxx <mailto:laszlo@xxxxxxxxxxxxxxxx>>> wrote:
     >
     >     Hello David,
     >
     >     Thank you for your message.
     >
     >     Indeed we were expecting to see the PGs from the lost host redistributed to the surviving host from the same chassis (failure domain), but the reality is different :(
     >     I can see a lot of PGs being stuck active+undersized+degraded and active+remapped. And for some of the remapped PGs I can see OSDs from the same failure domain.
     >
     >     We were able to reproduce the situation on a small test cluster also. So I think this is a bug in hammer. The small test cluster has reacted properly to failure when running Jewel.
     >
     >     Kind regards,
     >     Laszlo
     >
     >     On 30.05.2017 17:31, David Turner wrote:
     >      > If you lose 1 of the hosts in a chassis, or a single drive, the pgs from that drive/host will be distributed to other drives in that chassis (because you only have 3 failure domains). That is to say that if you lose tv-c1-al01 then all of the pgs and data that were on that will be distributed to tv-c1-al02. The reason for that is that you only have 3 failure domains and replica size 3.
     >      >
     >      > If you lost both tv-c1-al01 and tv-c1-al02, then you would run with only 2 copies of your data until you brought up a third failure domain again. Ceph would never place 2 copies of your data inside of 1 failure domain.
     >      >
     >      > I recommend not to run in production with less than N+2 failure domains where N is your replica size. It allows for more efficient data redundancy and you can utilize a higher % of your total capacity. If you have 4 failure domains, the plan is to be able to survive losing 1 of them... Which means you shouldn't use more than ~55% of your total capacity because of you lose a node, that 55% of 4 nodes becomes 73% of 3 nodes. Few clusters are balanced well enough to handle 73% full without individual osds going above 80%.  3 failure domains can work if you replace failed storage quickly.
     >      >
     >      >
     >      > On Mon, May 29, 2017, 12:07 PM Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx <mailto:laszlo@xxxxxxxxxxxxxxxx> <mailto:laszlo@xxxxxxxxxxxxxxxx <mailto:laszlo@xxxxxxxxxxxxxxxx>> <mailto:laszlo@xxxxxxxxxxxxxxxx <mailto:laszlo@xxxxxxxxxxxxxxxx> <mailto:laszlo@xxxxxxxxxxxxxxxx <mailto:laszlo@xxxxxxxxxxxxxxxx>>>> wrote:
     >      >
     >      >     Dear all,
     >      >
     >      >     How should ceph react in case of a host failure when from a total of 72 OSDs 12 are out?
     >      >     is it normal that for the remapping of the PGs it is not following the rule set for in the crush map? (according to the rule the OSDs should be selected from different chassis).
     >      >
     >      >     in the attached file you can find the crush map, and the results of:
     >      >     ceph health detail
     >      >     ceph osd dump
     >      >     ceph osd tree
     >      >     ceph -s
     >      >
     >      >     I can send the pg dump in a separate mail on request. Its compressed size is exceeding the size accepted by this mailing list.
     >      >
     >      >     Thank you for any help/directions.
     >      >
     >      >     Kind regards,
     >      >     Laszlo
     >      >
     >      >     On 29.05.2017 14:58, Laszlo Budai wrote:
     >      >      >
     >      >      > Hello all,
     >      >      >
     >      >      > We have a ceph cluster with 72 OSDs distributed on 6 hosts, in 3 chassis. In our crush map the we are distributing the PGs on chassis (complete crush map below):
     >      >      >
     >      >      > # rules
     >      >      > rule replicated_ruleset {
     >      >      >          ruleset 0
     >      >      >          type replicated
     >      >      >          min_size 1
     >      >      >          max_size 10
     >      >      >          step take default
     >      >      >          step chooseleaf firstn 0 type chassis
     >      >      >          step emit
     >      >      > }
     >      >      >
     >      >      > We had a host failure, and I can see that ceph is using 2 OSDs from the same chassis for a lot of the remapped PGs. Even worse, I can see that there are cases when a PG is using two OSDs from the same host like here:
     >      >      >
     >      >      > 3.5f6   37      0       4       37      0       149446656       3040    3040    active+remapped 2017-05-26 11:29:23.122820      61820'222074    61820:158025    [52,39] 52      [52,39,3]       52      61488'198356    2017-05-23 23:51:56.210597      61488'198356    2017-05-23 23:51:56.210597
     >      >      >
     >      >      > I have tis in the log:
     >      >      > 2017-05-26 11:26:53.244424 osd.52 10.12.193.69:6801/7044 <http://10.12.193.69:6801/7044> <http://10.12.193.69:6801/7044> <http://10.12.193.69:6801/7044> 1510 : cluster [INF] 3.5f6 restarting backfill on osd.39 from (0'0,0'0] MAX to 61488'203000
     >      >      >
     >      >      >
     >      >      > What can be wrong?
     >      >      >
     >      >      >
     >      >      > Our crush map looks like this:
     >      >      >
     >      >      > # begin crush map
     >      >      > tunable choose_local_tries 0
     >      >      > tunable choose_local_fallback_tries 0
     >      >      > tunable choose_total_tries 50
     >      >      > tunable chooseleaf_descend_once 1
     >      >      > tunable straw_calc_version 1
     >      >      >
     >      >      > # devices
     >      >      > device 0 osd.0
     >      >      > device 1 osd.1
     >      >      > device 2 osd.2
     >      >      > device 3 osd.3
     >      >      > ....
     >      >      > device 69 osd.69
     >      >      > device 70 osd.70
     >      >      > device 71 osd.71
     >      >      >
     >      >      > # types
     >      >      > type 0 osd
     >      >      > type 1 host
     >      >      > type 2 chassis
     >      >      > type 3 rack
     >      >      > type 4 row
     >      >      > type 5 pdu
     >      >      > type 6 pod
     >      >      > type 7 room
     >      >      > type 8 datacenter
     >      >      > type 9 region
     >      >      > type 10 root
     >      >      >
     >      >      > # buckets
     >      >      > host tv-c1-al01 {
     >      >      >          id -7           # do not change unnecessarily
     >      >      >          # weight 21.840
     >      >      >          alg straw
     >      >      >          hash 0  # rjenkins1
     >      >      >          item osd.5 weight 1.820
     >      >      >          item osd.11 weight 1.820
     >      >      >          item osd.17 weight 1.820
     >      >      >          item osd.23 weight 1.820
     >      >      >          item osd.29 weight 1.820
     >      >      >          item osd.35 weight 1.820
     >      >      >          item osd.41 weight 1.820
     >      >      >          item osd.47 weight 1.820
     >      >      >          item osd.53 weight 1.820
     >      >      >          item osd.59 weight 1.820
     >      >      >          item osd.65 weight 1.820
     >      >      >          item osd.71 weight 1.820
     >      >      > }
     >      >      > host tv-c1-al02 {
     >      >      >          id -3           # do not change unnecessarily
     >      >      >          # weight 21.840
     >      >      >          alg straw
     >      >      >          hash 0  # rjenkins1
     >      >      >          item osd.1 weight 1.820
     >      >      >          item osd.7 weight 1.820
     >      >      >          item osd.13 weight 1.820
     >      >      >          item osd.19 weight 1.820
     >      >      >          item osd.25 weight 1.820
     >      >      >          item osd.31 weight 1.820
     >      >      >          item osd.37 weight 1.820
     >      >      >          item osd.43 weight 1.820
     >      >      >          item osd.49 weight 1.820
     >      >      >          item osd.55 weight 1.820
     >      >      >          item osd.61 weight 1.820
     >      >      >          item osd.67 weight 1.820
     >      >      > }
     >      >      > chassis tv-c1 {
     >      >      >          id -8           # do not change unnecessarily
     >      >      >          # weight 43.680
     >      >      >          alg straw
     >      >      >          hash 0  # rjenkins1
     >      >      >          item tv-c1-al01 weight 21.840
     >      >      >          item tv-c1-al02 weight 21.840
     >      >      > }
     >      >      > host tv-c2-al01 {
     >      >      >          id -5           # do not change unnecessarily
     >      >      >          # weight 21.840
     >      >      >          alg straw
     >      >      >          hash 0  # rjenkins1
     >      >      >          item osd.3 weight 1.820
     >      >      >          item osd.9 weight 1.820
     >      >      >          item osd.15 weight 1.820
     >      >      >          item osd.21 weight 1.820
     >      >      >          item osd.27 weight 1.820
     >      >      >          item osd.33 weight 1.820
     >      >      >          item osd.39 weight 1.820
     >      >      >          item osd.45 weight 1.820
     >      >      >          item osd.51 weight 1.820
     >      >      >          item osd.57 weight 1.820
     >      >      >          item osd.63 weight 1.820
     >      >      >          item osd.70 weight 1.820
     >      >      > }
     >      >      > host tv-c2-al02 {
     >      >      >          id -2           # do not change unnecessarily
     >      >      >          # weight 21.840
     >      >      >          alg straw
     >      >      >          hash 0  # rjenkins1
     >      >      >          item osd.0 weight 1.820
     >      >      >          item osd.6 weight 1.820
     >      >      >          item osd.12 weight 1.820
     >      >      >          item osd.18 weight 1.820
     >      >      >          item osd.24 weight 1.820
     >      >      >          item osd.30 weight 1.820
     >      >      >          item osd.36 weight 1.820
     >      >      >          item osd.42 weight 1.820
     >      >      >          item osd.48 weight 1.820
     >      >      >          item osd.54 weight 1.820
     >      >      >          item osd.60 weight 1.820
     >      >      >          item osd.66 weight 1.820
     >      >      > }
     >      >      > chassis tv-c2 {
     >      >      >          id -9           # do not change unnecessarily
     >      >      >          # weight 43.680
     >      >      >          alg straw
     >      >      >          hash 0  # rjenkins1
     >      >      >          item tv-c2-al01 weight 21.840
     >      >      >          item tv-c2-al02 weight 21.840
     >      >      > }
     >      >      > host tv-c1-al03 {
     >      >      >          id -6           # do not change unnecessarily
     >      >      >          # weight 21.840
     >      >      >          alg straw
     >      >      >          hash 0  # rjenkins1
     >      >      >          item osd.4 weight 1.820
     >      >      >          item osd.10 weight 1.820
     >      >      >          item osd.16 weight 1.820
     >      >      >          item osd.22 weight 1.820
     >      >      >          item osd.28 weight 1.820
     >      >      >          item osd.34 weight 1.820
     >      >      >          item osd.40 weight 1.820
     >      >      >          item osd.46 weight 1.820
     >      >      >          item osd.52 weight 1.820
     >      >      >          item osd.58 weight 1.820
     >      >      >          item osd.64 weight 1.820
     >      >      >          item osd.69 weight 1.820
     >      >      > }
     >      >      > host tv-c2-al03 {
     >      >      >          id -4           # do not change unnecessarily
     >      >      >          # weight 21.840
     >      >      >          alg straw
     >      >      >          hash 0  # rjenkins1
     >      >      >          item osd.2 weight 1.820
     >      >      >          item osd.8 weight 1.820
     >      >      >          item osd.14 weight 1.820
     >      >      >          item osd.20 weight 1.820
     >      >      >          item osd.26 weight 1.820
     >      >      >          item osd.32 weight 1.820
     >      >      >          item osd.38 weight 1.820
     >      >      >          item osd.44 weight 1.820
     >      >      >          item osd.50 weight 1.820
     >      >      >          item osd.56 weight 1.820
     >      >      >          item osd.62 weight 1.820
     >      >      >          item osd.68 weight 1.820
     >      >      > }
     >      >      > chassis tv-c3 {
     >      >      >          id -10          # do not change unnecessarily
     >      >      >          # weight 43.680
     >      >      >          alg straw
     >      >      >          hash 0  # rjenkins1
     >      >      >          item tv-c1-al03 weight 21.840
     >      >      >          item tv-c2-al03 weight 21.840
     >      >      > }
     >      >      > root default {
     >      >      >          id -1           # do not change unnecessarily
     >      >      >          # weight 131.040
     >      >      >          alg straw
     >      >      >          hash 0  # rjenkins1
     >      >      >          item tv-c1 weight 43.680
     >      >      >          item tv-c2 weight 43.680
     >      >      >          item tv-c3 weight 43.680
     >      >      > }
     >      >      >
     >      >      > # rules
     >      >      > rule replicated_ruleset {
     >      >      >          ruleset 0
     >      >      >          type replicated
     >      >      >          min_size 1
     >      >      >          max_size 10
     >      >      >          step take default
     >      >      >          step chooseleaf firstn 0 type chassis
     >      >      >          step emit
     >      >      > }
     >      >      >
     >      >      > # end crush map
     >      >      >
     >      >      >
     >      >      > Thank you,
     >      >      > Laszlo
     >      >      > _______________________________________________
     >      >      > ceph-users mailing list
     >      >      > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>> <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>>
     >      >      > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >      >      >
     >      >     _______________________________________________
     >      >     ceph-users mailing list
     >      > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>> <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>>
     >      > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >      >
     >

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com