Re: New pacific mon won't join with octopus mons

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Gregory,

On Wed, Sep 01, 2021 at 10:56:56AM -0700, Gregory Farnum wrote:
Why are you trying to create a new pacific monitor instead of
upgrading an existing one?

The "ceph orch upgrade" failed twice at the point of upgrading the mons, once due to the octopus mons getting the "--init" argument added to their docker startup and the docker version on Debian Buster not supporting both the "--init" and "-v /dev:/dev" args at the same time, per:

https://github.com/moby/moby/pull/37665

...and once due to never having a cephfs on the cluster:

https://tracker.ceph.com/issues/51673

So at one point I had one mon down due to the failed upgrade, then another of the 3 originals was taken out by the host's disk filling up (due, I think, to the excessive logging occurring at the time in combination with having both docker and podman images pulled in), leaving me with a single octopus mon running and no quorum, bringing the cluster to a stand still, and me panic-learning how to deal with the situation. Fun times.

So yes, I was feeling just a little leery about upgrading the octopus mons and potentialy losing quorum again!

I *think* what's going on here is that since you're deploying a new
pacific mon, and you're not giving it a starting monmap, it's set up
to assume the use of pacific features. It can find peers at the
locations you've given it, but since they're on octopus there are
mismatches.

Now, I would expect and want this to work so you should file a bug,

https://tracker.ceph.com/issues/52488

but the initial bootstrapping code is a bit hairy and may not account
for cross-version initial setup in this fashion, or have gotten buggy
since written. So I'd try upgrading the existing mons, or generating a
new pacific mon and upgrading that one to octopus if you're feeling
leery.

Yes, I thought a safer / less stressful way of progressing would be to add a new octopus mon to the existing quorum and upgrade that one first as a test. I went ahead with that and checked the cluster health immediately afterwards: "ceph -s" showed HEALTH_OK, with 4 mons, i.e. 3 x octopus and 1 x pacific.

Nice! But shortly later alarms started going off and the health of the cluster was coming back as more than a little gut-wrenching, with ALL pgs showing up as inactive / unknown:

$ ceph -s
  cluster:
    id:     c6618970-0ce0-4cb2-bc9a-dd5f29b62e24
    health: HEALTH_WARN
            Reduced data availability: 5721 pgs inactive
            (muted: OSDMAP_FLAGS POOL_NO_REDUNDANCY)

  services:
    mon: 4 daemons, quorum k2,b2,b4,b5 (age 43m)
    mgr: b5(active, starting, since 40m), standbys: b4, b2
    osd: 78 osds: 78 up (since 4d), 78 in (since 3w)
         flags noout

  data:
    pools:   12 pools, 5721 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             5721 unknown

$ ceph health detail
HEALTH_WARN Reduced data availability: 5721 pgs inactive; (muted: OSDMAP_FLAGS POOL_NO_REDUNDANCY)
(MUTED) [WRN] OSDMAP_FLAGS: noout flag(s) set
[WRN] PG_AVAILABILITY: Reduced data availability: 5721 pgs inactive
    pg 6.fcd is stuck inactive for 41m, current state unknown, last acting []
    pg 6.fce is stuck inactive for 41m, current state unknown, last acting []
    pg 6.fcf is stuck inactive for 41m, current state unknown, last acting []
    pg 6.fd0 is stuck inactive for 41m, current state unknown, last acting []
    ...etc.

So that was also heaps of fun for a while, until I thought to remove the pacific mon and the health reverted to normal. Bug filed:

https://tracker.ceph.com/issues/52489

At this point I'm more than a little gun shy, but I'm girding my loins to go ahead with the rest of the upgrade on the basis the health issue is "just" a temporary reporting problem (albeit a highly startling one!) with mixed octopus and pacific mons.

Cheers,

Chris



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux