Re: PG mapped to OSDs on same host although 'chooseleaf type host'

Wido den Hollander <wido@xxxxxxxx> · Fri, 23 Feb 2018 10:40:16 +0100

On 02/23/2018 12:42 AM, Mike Lovell wrote:
was the pg-upmap feature used to force a pg to get mapped to a 
particular osd?

Yes it was. This is a semi-production cluster where the balancer module 
has been enabled with the upmap feature.

It remapped PGs it seems to OSDs on the same host.

root@man:~# ceph osd dump|grep pg_upmap|grep 1.41
pg_upmap_items 1.41 [9,15,11,7,10,2]
root@man:~#

I don't know exactly what I have to extract from that output, but it 
does seem to be the case here.

I removed the upmap entry for this PG and fixed it there:

$ ceph osd rm-pg-upmap-items 1.41

I also disabled the balancer for now (will report a issue) and removed 
all other upmap entries:

$ ceph osd dump|grep pg_upmap_items|awk '{print $2}'|xargs -n 1 ceph osd 
rm-pg-upmap-items

Thanks for the hint!

Wido

mike

On Thu, Feb 22, 2018 at 10:28 AM, Wido den Hollander <wido@xxxxxxxx 
<mailto:wido@xxxxxxxx>> wrote:

    Hi,

    I have a situation with a cluster which was recently upgraded to
    Luminous and has a PG mapped to OSDs on the same host.

    root@man:~# ceph pg map 1.41
    osdmap e21543 pg 1.41 (1.41) -> up [15,7,4] acting [15,7,4]
    root@man:~#

    root@man:~# ceph osd find 15|jq -r '.crush_location.host'
    n02
    root@man:~# ceph osd find 7|jq -r '.crush_location.host'
    n01
    root@man:~# ceph osd find 4|jq -r '.crush_location.host'
    n02
    root@man:~#

    As you can see, OSD 15 and 4 are both on the host 'n02'.

    This PG went inactive when the machine hosting both OSDs went down
    for maintenance.

    My first suspect was the CRUSHMap and the rules, but those are fine:

    rule replicated_ruleset {
             id 0
             type replicated
             min_size 1
             max_size 10
             step take default
             step chooseleaf firstn 0 type host
             step emit
    }

    This is the only rule in the CRUSHMap.

    ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
    -1       19.50325 root default
    -2        2.78618     host n01
      5   ssd  0.92999         osd.5      up  1.00000 1.00000
      7   ssd  0.92619         osd.7      up  1.00000 1.00000
    14   ssd  0.92999         osd.14     up  1.00000 1.00000
    -3        2.78618     host n02
      4   ssd  0.92999         osd.4      up  1.00000 1.00000
      8   ssd  0.92619         osd.8      up  1.00000 1.00000
    15   ssd  0.92999         osd.15     up  1.00000 1.00000
    -4        2.78618     host n03
      3   ssd  0.92999         osd.3      up  0.94577 1.00000
      9   ssd  0.92619         osd.9      up  0.82001 1.00000
    16   ssd  0.92999         osd.16     up  0.84885 1.00000
    -5        2.78618     host n04
      2   ssd  0.92999         osd.2      up  0.93501 1.00000
    10   ssd  0.92619         osd.10     up  0.76031 1.00000
    17   ssd  0.92999         osd.17     up  0.82883 1.00000
    -6        2.78618     host n05
      6   ssd  0.92999         osd.6      up  0.84470 1.00000
    11   ssd  0.92619         osd.11     up  0.80530 1.00000
    18   ssd  0.92999         osd.18     up  0.86501 1.00000
    -7        2.78618     host n06
      1   ssd  0.92999         osd.1      up  0.88353 1.00000
    12   ssd  0.92619         osd.12     up  0.79602 1.00000
    19   ssd  0.92999         osd.19     up  0.83171 1.00000
    -8        2.78618     host n07
      0   ssd  0.92999         osd.0      up  1.00000 1.00000
    13   ssd  0.92619         osd.13     up  0.86043 1.00000
    20   ssd  0.92999         osd.20     up  0.77153 1.00000

    Here you see osd.15 and osd.4 on the same host 'n02'.

    This cluster was upgraded from Hammer to Jewel and now Luminous and
    it doesn't have the latest tunables yet, but should that matter? I
    never encountered this before.

    tunable choose_local_tries 0
    tunable choose_local_fallback_tries 0
    tunable choose_total_tries 50
    tunable chooseleaf_descend_once 1
    tunable chooseleaf_vary_r 1
    tunable chooseleaf_stable 1
    tunable straw_calc_version 1
    tunable allowed_bucket_algs 54

    I don't want to touch this yet in the case this is a bug or glitch
    in the matrix somewhere.

    I hope it's just a admin mistake, but so far I'm not able to find a
    clue pointing to that.

    root@man:~# ceph osd dump|head -n 12
    epoch 21545
    fsid 0b6fb388-6233-4eeb-a55c-476ed12bdf0a
    created 2015-04-28 14:43:53.950159
    modified 2018-02-22 17:56:42.497849
    flags sortbitwise,recovery_deletes,purged_snapdirs
    crush_version 22
    full_ratio 0.95
    backfillfull_ratio 0.9
    nearfull_ratio 0.85
    require_min_compat_client luminous
    min_compat_client luminous
    require_osd_release luminous
    root@man:~#

    I also downloaded the CRUSHmap and ran crushtool with --test and
    --show-mappings, but that didn't show any PG mapped to the same host.

    Any ideas on what might be going on here?

    Wido
    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com