Re: Upgrade 16.2.6 -> 16.2.7 - MON assertion failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



All done. I restarted one MON after removing the sanity option just to be sure, and it's fine.
Thanks again for your help.
Chris

On 09/12/2021 18:38, Dan van der Ster wrote:
Hi,

Good to know, thanks.

Yes, you need to restart a daemon to undo a change applied via ceph.conf.

You can check exactly which config is currently used and where the
setting comes from using (running directly on the mon host):

ceph daemon mon.`hostname -s` config diff

The mons which had the setting from `ceph config set ...` probably
don't need to be restarted. Check what they're setting via config
diff.


-- dan


On Thu, Dec 9, 2021 at 7:32 PM Chris Palmer <chris.palmer@xxxxxxxxx> wrote:
Hi

Yes, using ceph config is working fine for the rest of the nodes.

Do you know if it is necessary/advisable to restart the MONs after
removing the mon_mds_skip_sanity setting when the upgrade is complete?

Thanks, Chris

On 09/12/2021 17:51, Dan van der Ster wrote:
Hi,

On Thu, Dec 9, 2021 at 6:44 PM Chris Palmer <chris.palmer@xxxxxxxxx> wrote:
Hi Dan & Patrick

Setting that to true using "ceph config" didn't seem to work. I then
deleted it from there and set it in ceph.conf on node1 and eventually
after a reboot it started ok. I don't know for sure whether it failing
using ceph config was real or just a symptom of something else.

I'll do the same (using ceph.conf) on the other nodes now.
Indeed, for a mon that is already asserting, you have confirmed that
it needs to be set in ceph.conf (otherwise it asserts before reading
the config map).

The other approach -- ceph config set mon ... --- should still work in
general, provided it is done before the upgrade begins.

You can see how cephadm does this here:
https://github.com/ceph/ceph/commit/753fd2fb32196d17e186152e7deaef1e0558b781

Btw, I can't actually see any release notes other than the highlights in
the earlier posting (and 16.2.7 doesn't show up on the web site list of
releases yet). Is there anything else that I would need to know?
The Release Notes PR is here: https://github.com/ceph/ceph/pull/44131
See my comment at the bottom.

Thanks for catching this!

Cheers, Dan


Thanks for your very fast responses!
Chris

On 09/12/2021 17:10, Dan van der Ster wrote:
On Thu, Dec 9, 2021 at 5:40 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
Hi Chris,

On Thu, Dec 9, 2021 at 10:40 AM Chris Palmer <chris.palmer@xxxxxxxxx> wrote:
Hi

I've just started an upgrade of a test cluster from 16.2.6 -> 16.2.7 and
immediately hit a problem.

The cluster started as octopus, and has upgraded through to 16.2.6
without any trouble. It is a conventional deployment on Debian 10, NOT
using cephadm. All was clean before the upgrade. It contains nodes as
follows:
- Node 1: MON, MGR, MDS, RGW
- Node 2: MON, MGR, MDS, RGW
- Node 3: MON
- Node 4-6: OSDs

In the absence of any specific upgrade instructions for 16.2.7, I
upgraded Node 1 and rebooted. The MON on that host will now not start,
throwing the following assertion:

2021-12-09T14:56:40.098+00:00 xxxxtstmon01 ceph-mon[960]: /build/ceph-16.2.7/src/mds/FSMap.cc: In function 'void FSMap::sanity(bool) const' thread 7f2d309085c0 time 2021-12-09T14:56:40.098395+0000
2021-12-09T14:56:40.098+00:00 xxxxtstmon01 ceph-mon[960]: /build/ceph-16.2.7/src/mds/FSMap.cc: 868: FAILED ceph_assert(info.compat.writeable(fs->mds_map.compat))
2021-12-09T14:56:40.103+00:00 xxxxtstmon01 ceph-mon[960]:  ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
2021-12-09T14:56:40.103+00:00 xxxxtstmon01 ceph-mon[960]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x7f2d3222423c]
2021-12-09T14:56:40.103+00:00 xxxxtstmon01 ceph-mon[960]:  2: /usr/lib/ceph/libceph-common.so.2(+0x277414) [0x7f2d32224414]
2021-12-09T14:56:40.103+00:00 xxxxtstmon01 ceph-mon[960]:  3: (FSMap::sanity(bool) const+0x2a8) [0x7f2d327331c8]
2021-12-09T14:56:40.103+00:00 xxxxtstmon01 ceph-mon[960]:  4: (MDSMonitor::update_from_paxos(bool*)+0x396) [0x55a32fe6b546]
2021-12-09T14:56:40.103+00:00 xxxxtstmon01 ceph-mon[960]:  5: (PaxosService::refresh(bool*)+0x10a) [0x55a32fd960ca]
2021-12-09T14:56:40.103+00:00 xxxxtstmon01 ceph-mon[960]:  6: (Monitor::refresh_from_paxos(bool*)+0x17c) [0x55a32fc54bec]
2021-12-09T14:56:40.103+00:00 xxxxtstmon01 ceph-mon[960]:  7: (Monitor::init_paxos()+0xfc) [0x55a32fc54e9c]
2021-12-09T14:56:40.103+00:00 xxxxtstmon01 ceph-mon[960]:  8: (Monitor::preinit()+0xbb9) [0x55a32fc7eb09]
2021-12-09T14:56:40.103+00:00 xxxxtstmon01 ceph-mon[960]:  9: main()
2021-12-09T14:56:40.103+00:00 xxxxtstmon01 ceph-mon[960]:  10: __libc_start_main()
2021-12-09T14:56:40.103+00:00 xxxxtstmon01 ceph-mon[960]:  11: _start()

ceph health detail merely shows mon01 down, and the 5 crashes before the service stopped auto-restarting.
Please disable mon_mds_skip_sanity in the mons ceph.conf:

[mon]
       mon_mds_skip_sanity = false
Oops, I think you meant   mon_mds_skip_sanity = true

Chris does that allow that mon to startup?

-- dan



The cephadm upgrade sequence is already doing this but I forgot
(sorry!) to mention this is required for manual upgrades in the
release notes.

Please re-enable after the upgrade completes and the cluster is stable.

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux