Re: strange remap on host failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



It should also be noted that hammer is pretty close to retirement and
is a poor choice for new clusters.

On Wed, May 31, 2017 at 6:17 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Mon, May 29, 2017 at 4:58 AM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> wrote:
>>
>> Hello all,
>>
>> We have a ceph cluster with 72 OSDs distributed on 6 hosts, in 3 chassis. In
>> our crush map the we are distributing the PGs on chassis (complete crush map
>> below):
>>
>> # rules
>> rule replicated_ruleset {
>>         ruleset 0
>>         type replicated
>>         min_size 1
>>         max_size 10
>>         step take default
>>         step chooseleaf firstn 0 type chassis
>>         step emit
>> }
>>
>> We had a host failure, and I can see that ceph is using 2 OSDs from the same
>> chassis for a lot of the remapped PGs. Even worse, I can see that there are
>> cases when a PG is using two OSDs from the same host like here:
>>
>> 3.5f6   37      0       4       37      0       149446656       3040    3040
>> active+remapped 2017-05-26 11:29:23.122820      61820'222074    61820:158025
>> [52,39] 52      [52,39,3]       52      61488'198356    2017-05-23
>> 23:51:56.210597      61488'198356    2017-05-23 23:51:56.210597
>>
>> I have tis in the log:
>> 2017-05-26 11:26:53.244424 osd.52 10.12.193.69:6801/7044 1510 : cluster
>> [INF] 3.5f6 restarting backfill on osd.39 from (0'0,0'0] MAX to 61488'203000
>>
>> What can be wrong?
>
> It's not clear from the output you've provided whether your pools have
> size 2 or 3. From what you've shown, I'm guessing you have size 2, and
> the OSD failure prompted a move of the PG in question away from OSD 3
> to OSD 39. Since 39 doesn't have any of the data yet, OSD 3 is being
> maintained in the acting set to maintain redundancy, but it will go
> away one the backfill is done.
>
> In general, it's a failure of CRUSH's design goals if you see moves of
> the replica within buckets which didn't experience failure, but they
> do sometimes happen. There have been a lot of improvements over the
> years to reduce how often that happens, some of which are supported by
> Hammer but not on by default (because it prevents use of older
> clients), some of which are only in very new code like the Luminous
> dev releases. I suspect you'd find things behave better under your
> cluster if you upgrade to Jewel and set the CRUSH flags it recommends
> to you.
> -Greg
>
>>
>>
>> Our crush map looks like this:
>>
>> # begin crush map
>> tunable choose_local_tries 0
>> tunable choose_local_fallback_tries 0
>> tunable choose_total_tries 50
>> tunable chooseleaf_descend_once 1
>> tunable straw_calc_version 1
>>
>> # devices
>> device 0 osd.0
>> device 1 osd.1
>> device 2 osd.2
>> device 3 osd.3
>> ....
>> device 69 osd.69
>> device 70 osd.70
>> device 71 osd.71
>>
>> # types
>> type 0 osd
>> type 1 host
>> type 2 chassis
>> type 3 rack
>> type 4 row
>> type 5 pdu
>> type 6 pod
>> type 7 room
>> type 8 datacenter
>> type 9 region
>> type 10 root
>>
>> # buckets
>> host tv-c1-al01 {
>>         id -7           # do not change unnecessarily
>>         # weight 21.840
>>         alg straw
>>         hash 0  # rjenkins1
>>         item osd.5 weight 1.820
>>         item osd.11 weight 1.820
>>         item osd.17 weight 1.820
>>         item osd.23 weight 1.820
>>         item osd.29 weight 1.820
>>         item osd.35 weight 1.820
>>         item osd.41 weight 1.820
>>         item osd.47 weight 1.820
>>         item osd.53 weight 1.820
>>         item osd.59 weight 1.820
>>         item osd.65 weight 1.820
>>         item osd.71 weight 1.820
>> }
>> host tv-c1-al02 {
>>         id -3           # do not change unnecessarily
>>         # weight 21.840
>>         alg straw
>>         hash 0  # rjenkins1
>>         item osd.1 weight 1.820
>>         item osd.7 weight 1.820
>>         item osd.13 weight 1.820
>>         item osd.19 weight 1.820
>>         item osd.25 weight 1.820
>>         item osd.31 weight 1.820
>>         item osd.37 weight 1.820
>>         item osd.43 weight 1.820
>>         item osd.49 weight 1.820
>>         item osd.55 weight 1.820
>>         item osd.61 weight 1.820
>>         item osd.67 weight 1.820
>> }
>> chassis tv-c1 {
>>         id -8           # do not change unnecessarily
>>         # weight 43.680
>>         alg straw
>>         hash 0  # rjenkins1
>>         item tv-c1-al01 weight 21.840
>>         item tv-c1-al02 weight 21.840
>> }
>> host tv-c2-al01 {
>>         id -5           # do not change unnecessarily
>>         # weight 21.840
>>         alg straw
>>         hash 0  # rjenkins1
>>         item osd.3 weight 1.820
>>         item osd.9 weight 1.820
>>         item osd.15 weight 1.820
>>         item osd.21 weight 1.820
>>         item osd.27 weight 1.820
>>         item osd.33 weight 1.820
>>         item osd.39 weight 1.820
>>         item osd.45 weight 1.820
>>         item osd.51 weight 1.820
>>         item osd.57 weight 1.820
>>         item osd.63 weight 1.820
>>         item osd.70 weight 1.820
>> }
>> host tv-c2-al02 {
>>         id -2           # do not change unnecessarily
>>         # weight 21.840
>>         alg straw
>>         hash 0  # rjenkins1
>>         item osd.0 weight 1.820
>>         item osd.6 weight 1.820
>>         item osd.12 weight 1.820
>>         item osd.18 weight 1.820
>>         item osd.24 weight 1.820
>>         item osd.30 weight 1.820
>>         item osd.36 weight 1.820
>>         item osd.42 weight 1.820
>>         item osd.48 weight 1.820
>>         item osd.54 weight 1.820
>>         item osd.60 weight 1.820
>>         item osd.66 weight 1.820
>> }
>> chassis tv-c2 {
>>         id -9           # do not change unnecessarily
>>         # weight 43.680
>>         alg straw
>>         hash 0  # rjenkins1
>>         item tv-c2-al01 weight 21.840
>>         item tv-c2-al02 weight 21.840
>> }
>> host tv-c1-al03 {
>>         id -6           # do not change unnecessarily
>>         # weight 21.840
>>         alg straw
>>         hash 0  # rjenkins1
>>         item osd.4 weight 1.820
>>         item osd.10 weight 1.820
>>         item osd.16 weight 1.820
>>         item osd.22 weight 1.820
>>         item osd.28 weight 1.820
>>         item osd.34 weight 1.820
>>         item osd.40 weight 1.820
>>         item osd.46 weight 1.820
>>         item osd.52 weight 1.820
>>         item osd.58 weight 1.820
>>         item osd.64 weight 1.820
>>         item osd.69 weight 1.820
>> }
>> host tv-c2-al03 {
>>         id -4           # do not change unnecessarily
>>         # weight 21.840
>>         alg straw
>>         hash 0  # rjenkins1
>>         item osd.2 weight 1.820
>>         item osd.8 weight 1.820
>>         item osd.14 weight 1.820
>>         item osd.20 weight 1.820
>>         item osd.26 weight 1.820
>>         item osd.32 weight 1.820
>>         item osd.38 weight 1.820
>>         item osd.44 weight 1.820
>>         item osd.50 weight 1.820
>>         item osd.56 weight 1.820
>>         item osd.62 weight 1.820
>>         item osd.68 weight 1.820
>> }
>> chassis tv-c3 {
>>         id -10          # do not change unnecessarily
>>         # weight 43.680
>>         alg straw
>>         hash 0  # rjenkins1
>>         item tv-c1-al03 weight 21.840
>>         item tv-c2-al03 weight 21.840
>> }
>> root default {
>>         id -1           # do not change unnecessarily
>>         # weight 131.040
>>         alg straw
>>         hash 0  # rjenkins1
>>         item tv-c1 weight 43.680
>>         item tv-c2 weight 43.680
>>         item tv-c3 weight 43.680
>> }
>>
>> # rules
>> rule replicated_ruleset {
>>         ruleset 0
>>         type replicated
>>         min_size 1
>>         max_size 10
>>         step take default
>>         step chooseleaf firstn 0 type chassis
>>         step emit
>> }
>>
>> # end crush map
>>
>>
>> Thank you,
>> Laszlo
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux