Re: PG stuck in active+clean+remapped

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is just a followup for those who will encounter similar problem.

Originally this was a pool with only 4 nodes, size 3, min_size 2, big node/osd weight difference(node weights 10, 2, 4, 4, osd weights from 2.5 to 0.5. detailed CRUSH map below(only 3 nodes left, issue persisted at this point)[1])
When we exclude one of smaller nodes from the pool - this issue appeares.

It turned out that new mapping of [26,14,9] tried to put PG on the same node twice, which is conflicting with CRUSH rule for the pool[2]. osd.26 and osd.9 are residing on the same node, and rule instructs to place a PG copy on a separate node.
For some reason cluster was not able to do that, thought it has required amount of nodes.

Anyway, I've googled a similar issue[3], and there was mentioning that weight difference can be an issue.
So we took out one osd from the fat node, and new mapping worked fine, issue desappeared.
I guess CRUSH algorithm can't handle some extreme weight differences, which is to be expected(?).

[1]
host backup1 {
        id -19          # do not change unnecessarily
        id -41 class hdd                # do not change unnecessarily
        id -31 class ssd                # do not change unnecessarily
        # weight 10.920
        alg straw2
        hash 0  # rjenkins1
        item osd.19 weight 2.730
        item osd.35 weight 2.730
        item osd.13 weight 2.730
        item osd.14 weight 2.730
}
host backup2 {
        id -20          # do not change unnecessarily
        id -42 class hdd                # do not change unnecessarily
        id -32 class ssd                # do not change unnecessarily
        # weight 2.544
        alg straw2
        hash 0  # rjenkins1
        item osd.33 weight 0.545
        item osd.36 weight 0.545
        item osd.12 weight 0.545
        item osd.34 weight 0.909
}
host backup3 {
        id -22          # do not change unnecessarily
        id -43 class hdd                # do not change unnecessarily
        id -36 class ssd                # do not change unnecessarily
        # weight 4.361
        alg straw2
        hash 0  # rjenkins1
        item osd.29 weight 0.545
        item osd.22 weight 0.545
        item osd.28 weight 0.545
        item osd.24 weight 0.545
        item osd.26 weight 0.545
        item osd.20 weight 0.546
        item osd.9 weight 0.545
        item osd.21 weight 0.545
}
root backups {
        id -21          # do not change unnecessarily
        id -30 class hdd                # do not change unnecessarily
        id -40 class ssd                # do not change unnecessarily
        # weight 17.825
        alg straw2
        hash 0  # rjenkins1
        item backup1 weight 10.920
        item backup2 weight 2.544
        item backup3 weight 4.361
}

[2]
rule backups-rule {
        id 3
        type replicated
        min_size 1
        max_size 10
        step take backups
        step chooseleaf firstn 0 type host
        step emit
}


пн, 1 апр. 2019 г. в 12:23, Vladimir Prokofev <v@xxxxxxxxxxx>:
As we fixed failed node next day, cluster rebalanced to it's original state without any issues, so crush dump would be irrelevant at this point I guess. Will have to wait for next occurence.
Here's a tunables part, maybe it will help to shed some light:

    "tunables": {
        "choose_local_tries": 0,
        "choose_local_fallback_tries": 0,
        "choose_total_tries": 50,
        "chooseleaf_descend_once": 1,
        "chooseleaf_vary_r": 1,
        "chooseleaf_stable": 0,
        "straw_calc_version": 1,
        "allowed_bucket_algs": 22,
        "profile": "firefly",
        "optimal_tunables": 0,
        "legacy_tunables": 0,
        "minimum_required_version": "firefly",
        "require_feature_tunables": 1,
        "require_feature_tunables2": 1,
        "has_v2_rules": 0,
        "require_feature_tunables3": 1,
        "has_v3_rules": 0,
        "has_v4_buckets": 0,
        "require_feature_tunables5": 0,
        "has_v5_rules": 0
    },

вс, 31 мар. 2019 г. в 13:28, huang jun <hjwsm1989@xxxxxxxxx>:
seems like the crush cannot get enough osds for this pg,
what the output of 'ceph osd crush dump' and especially the 'tunables'
section values?

Vladimir Prokofev <v@xxxxxxxxxxx> 于2019年3月27日周三 上午4:02写道:
>
> CEPH 12.2.11, pool size 3, min_size 2.
>
> One node went down today(private network interface started flapping, and after a while OSD processes crashed), no big deal, cluster recovered, but not completely. 1 PG stuck in active+clean+remapped state.
>
> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES       LOG  DISK_LOG STATE                 STATE_STAMP                VERSION         REPORTED        UP         UP_PRIMARY ACTING     ACTING_PRIMARY LAST_SCRUB      SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN
> 20.a2       511                  0        0       511       0  1584410172 1500     1500 active+clean+remapped 2019-03-26 20:50:18.639452    96149'189204    96861:935872    [26,14]         26  [26,14,9]             26    96149'189204 2019-03-26 10:47:36.174769    95989'187669 2019-03-22 23:29:02.322848             0
>
> it states it's placed on 26,14 OSDs, should be on 26,14,9. As far as I can see there's nothing wrong with any of those OSDs, they work, host other PGs, peer with each other, etc. I tried restarting all of them one after another, but without any success.
> OSD 9 hosts 95 other PGs, don't think it's PG overdose.
>
> Last line of log from osd.9 mentioning PG 20.a2:
> 2019-03-26 20:50:16.294500 7fe27963a700  1 osd.9 pg_epoch: 96860 pg[20.a2( v 96149'189204 (95989'187645,96149'189204] local-lis/les=96857/96858 n=511 ec=39164/39164 lis/c 96857/96855 les/c/f 96858/96856/66611 96859/96860/96855) [26,14]/[26,14,9] r=2 lpr=96860 pi=[96855,96860)/1 crt=96149'189204 lcod 0'0 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray
>
> Nothing else out of ordinary, just usual scrubs/deep-scrubs notifications.
> Any ideas what it it can be, or any other steps to troubleshoot this?
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Thank you!
HuangJun
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux