Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 10 Feb 2022 11:43:20 -0800

“Up” is the set of OSDs which are alive from the calculated crush mapping.
“Acting” includes those extras which have been added in to bring the PG up
to proper size. So the PG does have 3 live OSDs serving it.

But perhaps the safety check *is* looking at up instead of acting? That
seems like a plausible bug. (Also, if crush is failing to map properly,
that’s not a great sign for your cluster health or design.)

On Thu, Feb 10, 2022 at 11:26 AM 胡 玮文 <huww98@xxxxxxxxxxx> wrote:

> I believe this is the reason.
>
> I mean number of OSDs in the “up” set should be at least 1 greater than
> the min_size for the upgrade to proceed. Or once any OSD is stopped, it can
> drop below min_size, and prevent the pg from becoming active. So just
> cleanup the misplaced and the upgrade should proceed automatically.
>
> But I’m a little confused. I think if you have only 2 up OSD in a
> replicate x3 pool, it should in degraded state, and should give you a
> HEALTH_WARN.
>
> 在 2022年2月11日，03:06，Zach Heise (SSCC) <heise@xxxxxxxxxxxx> 写道：
>
> 
>
> Hi Weiwen, thanks for replying.
>
> All of my replicated pools, including the newest ssdpool I made most
> recently, have a min_size of 2. My other two EC pools have a min_size of 3.
>
> Looking at pg dump output again, it does look like the two EC pools have
> exactly 4 OSDs listed in the "Acting" column, and everything else has 3
> OSDs in Acting. So that's as it should be, I believe?
>
> I do have some 'misplaced' objects on 8 different PGs (the
> active+clean+remapped ones listed in my original ceph -s output), that only
> have 2 "up" OSDs listed, but in the "Acting" columns each have 3 OSDs as
> they should. Apparently these 231 misplaced objects aren't enough to cause
> ceph to drop out of HEALTH_OK status.
>
> Zach
>
>
> On 2022-02-10 12:41 PM, huww98@xxxxxxxxxxx<mailto:huww98@xxxxxxxxxxx>
> wrote:
>
> Hi Zach,
>
> How about your min_size setting? Have you checked the number of OSDs in
> the acting set of every PG is at least 1 greater than the min_size of the
> corresponding pool?
>
> Weiwen Hu
>
>
>
> 在 2022年2月10日，05:02，Zach Heise (SSCC) <heise@xxxxxxxxxxxx><mailto:
> heise@xxxxxxxxxxxx> 写道：
>
> Hello,
>
> ceph health detail says my 5-node cluster is healthy, yet when I ran ceph
> orch upgrade start --ceph-version 16.2.7 everything seemed to go fine until
> we got to the OSD section, now for the past hour, every 15 seconds a new
> log entry of  'Upgrade: unsafe to stop osd(s) at this time (1 PGs are or
> would become offline)' appears in the logs.
>
> ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything too.
> Yet somehow 1 PG is (apparently) holding up all the OSD upgrades and not
> letting the process finish. Should I stop the upgrade and try it again? (I
> haven't done that before so was just nervous to try it). Any other ideas?
>
>  cluster:
>    id:     9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0
>    health: HEALTH_OK
>   services:
>    mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m)
>    mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp
>    mds: 1/1 daemons up, 1 hot standby
>    osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs
>   data:
>    volumes: 1/1 healthy
>    pools:   7 pools, 193 pgs
>    objects: 3.72k objects, 14 GiB
>    usage:   43 GiB used, 64 TiB / 64 TiB avail
>    pgs:     231/11170 objects misplaced (2.068%)
>             185 active+clean
>             8   active+clean+remapped
>   io:
>    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
>   progress:
>    Upgrade to 16.2.7 (5m)
>      [=====.......................] (remaining: 24m)
>
> --
> Zach
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx