I don’t know how to get better errors out of cephadm, but the only way I can think of for this to happen is if your crush rule is somehow placing multiple replicas of a pg on a single host that cephadm wants to upgrade. So check your rules, your pool sizes, and osd tree? -Greg On Thu, Feb 10, 2022 at 8:25 AM Zach Heise (SSCC) <heise@xxxxxxxxxxxx> wrote: > It could be an issue with the devicehealthpool as you are correct, it is a > single PG - but when the cluster is reporting that everything is healthy, > it's difficult where to go from there. What I don't understand is why its > refusing to upgrade ANY of the osd daemons; I have 33 of them, why would a > single PG going offline be a problem for all of them? > > I did try stopping the upgrade and restarting it, but it just picks up at > the same place (11/56 daemons upgraded) and immediately reports the same > issue. > > Is there any way to at least tell which PG is the problematic one? > > > Zach > > > On 2022-02-09 4:19 PM, anthony.datri@xxxxxxxxx wrote: > > Speculation: might the devicehealth pool be involved? It seems to typically have just 1 PG. > > > > > On Feb 9, 2022, at 1:41 PM, Zach Heise (SSCC) <heise@xxxxxxxxxxxx> <heise@xxxxxxxxxxxx> wrote: > > Good afternoon, thank you for your reply. Yes I know you are right, eventually we'll switch to an odd number of mons rather than even. We're still in 'testing' mode right now and only my coworkers and I are using the cluster. > > Of the 7 pools, all but 2 are replica x3. The last two are EC 2+2. > > Zach Heise > > > On 2022-02-09 3:38 PM, sascha.arthur@xxxxxxxxx wrote: > > Hello, > > all your pools running replica > 1? > also having 4 monitors is pretty bad for split brain situations.. > > Zach Heise (SSCC) <heise@xxxxxxxxxxxx> <heise@xxxxxxxxxxxx> schrieb am Mi., 9. Feb. 2022, 22:02: > > Hello, > > ceph health detail says my 5-node cluster is healthy, yet when I ran > ceph orch upgrade start --ceph-version 16.2.7 everything seemed to go > fine until we got to the OSD section, now for the past hour, every 15 > seconds a new log entry of 'Upgrade: unsafe to stop osd(s) at > this time > (1 PGs are or would become offline)' appears in the logs. > > ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything > too. Yet somehow 1 PG is (apparently) holding up all the OSD upgrades > and not letting the process finish. Should I stop the upgrade and > try it > again? (I haven't done that before so was just nervous to try it). > Any > other ideas? > > cluster: > id: 9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0 > health: HEALTH_OK > > services: > mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m) > mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp > mds: 1/1 daemons up, 1 hot standby > osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs > > data: > volumes: 1/1 healthy > pools: 7 pools, 193 pgs > objects: 3.72k objects, 14 GiB > usage: 43 GiB used, 64 TiB / 64 TiB avail > pgs: 231/11170 objects misplaced (2.068%) > 185 active+clean > 8 active+clean+remapped > > io: > client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr > > progress: > Upgrade to 16.2.7 (5m) > [=====.......................] (remaining: 24m) > > -- Zach > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx