Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



It could be an issue with the devicehealthpool as you are correct, it is a single PG - but when the cluster is reporting that everything is healthy, it's difficult where to go from there. What I don't understand is why its refusing to upgrade ANY of the osd daemons; I have 33 of them, why would a single PG going offline be a problem for all of them?

I did try stopping the upgrade and restarting it, but it just picks up at the same place (11/56 daemons upgraded) and immediately reports the same issue.

Is there any way to at least tell which PG is the problematic one?

Zach


On 2022-02-09 4:19 PM, anthony.datri@xxxxxxxxx wrote:
Speculation:  might the devicehealth pool be involved?  It seems to typically have just 1 PG.



On Feb 9, 2022, at 1:41 PM, Zach Heise (SSCC) <heise@xxxxxxxxxxxx> wrote:

Good afternoon, thank you for your reply. Yes I know you are right, eventually we'll switch to an odd number of mons rather than even. We're still in 'testing' mode right now and only my coworkers and I are using the cluster.

Of the 7 pools, all but 2 are replica x3. The last two are EC 2+2.

Zach Heise


On 2022-02-09 3:38 PM, sascha.arthur@xxxxxxxxx wrote:
Hello,

all your pools running replica > 1?
also having 4 monitors is pretty bad for split brain situations..

Zach Heise (SSCC) <heise@xxxxxxxxxxxx> schrieb am Mi., 9. Feb. 2022, 22:02:

   Hello,

   ceph health detail says my 5-node cluster is healthy, yet when I ran
   ceph orch upgrade start --ceph-version 16.2.7 everything seemed to go
   fine until we got to the OSD section, now for the past hour, every 15
   seconds a new log entry of  'Upgrade: unsafe to stop osd(s) at
   this time
   (1 PGs are or would become offline)' appears in the logs.

   ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything
   too. Yet somehow 1 PG is (apparently) holding up all the OSD upgrades
   and not letting the process finish. Should I stop the upgrade and
   try it
   again? (I haven't done that before so was just nervous to try it).
   Any
   other ideas?

      cluster:
        id:     9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0
        health: HEALTH_OK

      services:
        mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m)
        mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp
        mds: 1/1 daemons up, 1 hot standby
        osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs

      data:
        volumes: 1/1 healthy
        pools:   7 pools, 193 pgs
        objects: 3.72k objects, 14 GiB
        usage:   43 GiB used, 64 TiB / 64 TiB avail
        pgs:     231/11170 objects misplaced (2.068%)
                 185 active+clean
                 8   active+clean+remapped

      io:
        client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

      progress:
        Upgrade to 16.2.7 (5m)
          [=====.......................] (remaining: 24m)

   --     Zach
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux