Check here for further details about completing upgrades (this started in Luminous series):
https://ceph.io/en/news/blog/2017/new-luminous-upgrade-complete/ -Drew On 9/1/21 8:41 PM, Chris Dunlop wrote:
Hi Gregory, On Wed, Sep 01, 2021 at 10:56:56AM -0700, Gregory Farnum wrote:Why are you trying to create a new pacific monitor instead of upgrading an existing one?The "ceph orch upgrade" failed twice at the point of upgrading the mons, once due to the octopus mons getting the "--init" argument added to their docker startup and the docker version on Debian Buster not supporting both the "--init" and "-v /dev:/dev" args at the same time, per:https://github.com/moby/moby/pull/37665 ...and once due to never having a cephfs on the cluster: https://tracker.ceph.com/issues/51673So at one point I had one mon down due to the failed upgrade, then another of the 3 originals was taken out by the host's disk filling up (due, I think, to the excessive logging occurring at the time in combination with having both docker and podman images pulled in), leaving me with a single octopus mon running and no quorum, bringing the cluster to a stand still, and me panic-learning how to deal with the situation. Fun times.So yes, I was feeling just a little leery about upgrading the octopus mons and potentialy losing quorum again!I *think* what's going on here is that since you're deploying a new pacific mon, and you're not giving it a starting monmap, it's set up to assume the use of pacific features. It can find peers at the locations you've given it, but since they're on octopus there are mismatches. Now, I would expect and want this to work so you should file a bug,https://tracker.ceph.com/issues/52488but the initial bootstrapping code is a bit hairy and may not account for cross-version initial setup in this fashion, or have gotten buggy since written. So I'd try upgrading the existing mons, or generating a new pacific mon and upgrading that one to octopus if you're feeling leery.Yes, I thought a safer / less stressful way of progressing would be to add a new octopus mon to the existing quorum and upgrade that one first as a test. I went ahead with that and checked the cluster health immediately afterwards: "ceph -s" showed HEALTH_OK, with 4 mons, i.e. 3 x octopus and 1 x pacific.Nice!But shortly later alarms started going off and the health of the cluster was coming back as more than a little gut-wrenching, with ALL pgs showing up as inactive / unknown:$ ceph -s cluster: id: c6618970-0ce0-4cb2-bc9a-dd5f29b62e24 health: HEALTH_WARN Reduced data availability: 5721 pgs inactive (muted: OSDMAP_FLAGS POOL_NO_REDUNDANCY) services: mon: 4 daemons, quorum k2,b2,b4,b5 (age 43m) mgr: b5(active, starting, since 40m), standbys: b4, b2 osd: 78 osds: 78 up (since 4d), 78 in (since 3w) flags noout data: pools: 12 pools, 5721 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: 100.000% pgs unknown 5721 unknown $ ceph health detailHEALTH_WARN Reduced data availability: 5721 pgs inactive; (muted: OSDMAP_FLAGS POOL_NO_REDUNDANCY)(MUTED) [WRN] OSDMAP_FLAGS: noout flag(s) set [WRN] PG_AVAILABILITY: Reduced data availability: 5721 pgs inactivepg 6.fcd is stuck inactive for 41m, current state unknown, last acting [] pg 6.fce is stuck inactive for 41m, current state unknown, last acting [] pg 6.fcf is stuck inactive for 41m, current state unknown, last acting [] pg 6.fd0 is stuck inactive for 41m, current state unknown, last acting []...etc.So that was also heaps of fun for a while, until I thought to remove the pacific mon and the health reverted to normal. Bug filed:https://tracker.ceph.com/issues/52489At this point I'm more than a little gun shy, but I'm girding my loins to go ahead with the rest of the upgrade on the basis the health issue is "just" a temporary reporting problem (albeit a highly startling one!) with mixed octopus and pacific mons.Cheers, Chris
Attachment:
OpenPGP_0xCAA4F6E122942A58.asc
Description: OpenPGP public key
Attachment:
OpenPGP_signature
Description: OpenPGP digital signature