Re: strange remap on host failure

Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> · Wed, 31 May 2017 06:45:57 +0300

Hello Greg!

Thank you for the answer.

Our pools have their size set to 3:

tv-dl360-1:~$ ceph osd pool ls detail
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 1 'images' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 61453 flags hashpspool stripe_width 0
        removed_snaps [1~9,c~c2,cf~280]
pool 2 'instances' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 61455 flags hashpspool stripe_width 0
        removed_snaps [1~3]
pool 3 'volumes' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 61457 flags hashpspool stripe_width 0
        removed_snaps [1~13d]
pool 4 '.rgw.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 266 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 5 '.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 270 flags hashpspool stripe_width 0
pool 6 '.rgw' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 272 flags hashpspool stripe_width 0
pool 7 '.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 272 flags hashpspool stripe_width 0
pool 8 '.users.uid' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 279 flags hashpspool stripe_width 0
pool 9 '.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 285 flags hashpspool stripe_width 0
pool 10 '.rgw.buckets' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 2846 flags hashpspool stripe_width 0

tv-dl360-1:~$

I know it's not a good practice to have the min_size set to 1, but this is how I've found it. This will be fixed when the cluster will be back in a healthy state. Anyway, the issue is present on all the pools.

Can you tell me more about those features which are present in hammer, but not on by default?

Yesterday we were able to reproduce the issue on a test cluster. Hammer has performed the same way, but Jewel has worked properly.
Upgrading to jewel is planned, but it was not decided yet when to happen.

Thank you,
Laszlo

On 30.05.2017 23:17, Gregory Farnum wrote:
On Mon, May 29, 2017 at 4:58 AM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> wrote:

Hello all,

We have a ceph cluster with 72 OSDs distributed on 6 hosts, in 3 chassis. In
our crush map the we are distributing the PGs on chassis (complete crush map
below):

# rules
rule replicated_ruleset {
         ruleset 0
         type replicated
         min_size 1
         max_size 10
         step take default
         step chooseleaf firstn 0 type chassis
         step emit
}

We had a host failure, and I can see that ceph is using 2 OSDs from the same
chassis for a lot of the remapped PGs. Even worse, I can see that there are
cases when a PG is using two OSDs from the same host like here:

3.5f6   37      0       4       37      0       149446656       3040    3040
active+remapped 2017-05-26 11:29:23.122820      61820'222074    61820:158025
[52,39] 52      [52,39,3]       52      61488'198356    2017-05-23
23:51:56.210597      61488'198356    2017-05-23 23:51:56.210597

I have tis in the log:
2017-05-26 11:26:53.244424 osd.52 10.12.193.69:6801/7044 1510 : cluster
[INF] 3.5f6 restarting backfill on osd.39 from (0'0,0'0] MAX to 61488'203000

What can be wrong?

It's not clear from the output you've provided whether your pools have
size 2 or 3. From what you've shown, I'm guessing you have size 2, and
the OSD failure prompted a move of the PG in question away from OSD 3
to OSD 39. Since 39 doesn't have any of the data yet, OSD 3 is being
maintained in the acting set to maintain redundancy, but it will go
away one the backfill is done.

In general, it's a failure of CRUSH's design goals if you see moves of
the replica within buckets which didn't experience failure, but they
do sometimes happen. There have been a lot of improvements over the
years to reduce how often that happens, some of which are supported by
Hammer but not on by default (because it prevents use of older
clients), some of which are only in very new code like the Luminous
dev releases. I suspect you'd find things behave better under your
cluster if you upgrade to Jewel and set the CRUSH flags it recommends
to you.
-Greg

Our crush map looks like this:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
....
device 69 osd.69
device 70 osd.70
device 71 osd.71

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host tv-c1-al01 {
         id -7           # do not change unnecessarily
         # weight 21.840
         alg straw
         hash 0  # rjenkins1
         item osd.5 weight 1.820
         item osd.11 weight 1.820
         item osd.17 weight 1.820
         item osd.23 weight 1.820
         item osd.29 weight 1.820
         item osd.35 weight 1.820
         item osd.41 weight 1.820
         item osd.47 weight 1.820
         item osd.53 weight 1.820
         item osd.59 weight 1.820
         item osd.65 weight 1.820
         item osd.71 weight 1.820
}
host tv-c1-al02 {
         id -3           # do not change unnecessarily
         # weight 21.840
         alg straw
         hash 0  # rjenkins1
         item osd.1 weight 1.820
         item osd.7 weight 1.820
         item osd.13 weight 1.820
         item osd.19 weight 1.820
         item osd.25 weight 1.820
         item osd.31 weight 1.820
         item osd.37 weight 1.820
         item osd.43 weight 1.820
         item osd.49 weight 1.820
         item osd.55 weight 1.820
         item osd.61 weight 1.820
         item osd.67 weight 1.820
}
chassis tv-c1 {
         id -8           # do not change unnecessarily
         # weight 43.680
         alg straw
         hash 0  # rjenkins1
         item tv-c1-al01 weight 21.840
         item tv-c1-al02 weight 21.840
}
host tv-c2-al01 {
         id -5           # do not change unnecessarily
         # weight 21.840
         alg straw
         hash 0  # rjenkins1
         item osd.3 weight 1.820
         item osd.9 weight 1.820
         item osd.15 weight 1.820
         item osd.21 weight 1.820
         item osd.27 weight 1.820
         item osd.33 weight 1.820
         item osd.39 weight 1.820
         item osd.45 weight 1.820
         item osd.51 weight 1.820
         item osd.57 weight 1.820
         item osd.63 weight 1.820
         item osd.70 weight 1.820
}
host tv-c2-al02 {
         id -2           # do not change unnecessarily
         # weight 21.840
         alg straw
         hash 0  # rjenkins1
         item osd.0 weight 1.820
         item osd.6 weight 1.820
         item osd.12 weight 1.820
         item osd.18 weight 1.820
         item osd.24 weight 1.820
         item osd.30 weight 1.820
         item osd.36 weight 1.820
         item osd.42 weight 1.820
         item osd.48 weight 1.820
         item osd.54 weight 1.820
         item osd.60 weight 1.820
         item osd.66 weight 1.820
}
chassis tv-c2 {
         id -9           # do not change unnecessarily
         # weight 43.680
         alg straw
         hash 0  # rjenkins1
         item tv-c2-al01 weight 21.840
         item tv-c2-al02 weight 21.840
}
host tv-c1-al03 {
         id -6           # do not change unnecessarily
         # weight 21.840
         alg straw
         hash 0  # rjenkins1
         item osd.4 weight 1.820
         item osd.10 weight 1.820
         item osd.16 weight 1.820
         item osd.22 weight 1.820
         item osd.28 weight 1.820
         item osd.34 weight 1.820
         item osd.40 weight 1.820
         item osd.46 weight 1.820
         item osd.52 weight 1.820
         item osd.58 weight 1.820
         item osd.64 weight 1.820
         item osd.69 weight 1.820
}
host tv-c2-al03 {
         id -4           # do not change unnecessarily
         # weight 21.840
         alg straw
         hash 0  # rjenkins1
         item osd.2 weight 1.820
         item osd.8 weight 1.820
         item osd.14 weight 1.820
         item osd.20 weight 1.820
         item osd.26 weight 1.820
         item osd.32 weight 1.820
         item osd.38 weight 1.820
         item osd.44 weight 1.820
         item osd.50 weight 1.820
         item osd.56 weight 1.820
         item osd.62 weight 1.820
         item osd.68 weight 1.820
}
chassis tv-c3 {
         id -10          # do not change unnecessarily
         # weight 43.680
         alg straw
         hash 0  # rjenkins1
         item tv-c1-al03 weight 21.840
         item tv-c2-al03 weight 21.840
}
root default {
         id -1           # do not change unnecessarily
         # weight 131.040
         alg straw
         hash 0  # rjenkins1
         item tv-c1 weight 43.680
         item tv-c2 weight 43.680
         item tv-c3 weight 43.680
}

# rules
rule replicated_ruleset {
         ruleset 0
         type replicated
         min_size 1
         max_size 10
         step take default
         step chooseleaf firstn 0 type chassis
         step emit
}

# end crush map

Thank you,
Laszlo
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com