Re: HELP! Upgrading monitors from 14.2.22 to 16.2.7 immediately crashes in FSMap::decode()

Tyler Stachecki <stachecki.tyler@xxxxxxxxx> · Sun, 20 Mar 2022 22:40:19 -0400

What does 'ceph mon dump | grep min_mon_release' say?  You're running
msgrv2 and all Ceph daemons are talking on v2, since you're on
Nautilus, right?

Was the cluster conceived on Nautilus, or something earlier?

Tyler

On Sun, Mar 20, 2022 at 10:30 PM Clippinger, Sam
<Sam.Clippinger@xxxxxxxxxx> wrote:
>
> Hello!
>
> I need some help.  I'm trying to upgrade from Ceph Nautilus 14.2.22 cluster to Pacific (manually, not using cephadm).  I've only tried upgrading one monitor so far and I've hit several snags.  I've tried to troubleshooting the issue without losing the cluster (of course it's a production cluster, the test cluster upgraded just fine).
>
> This cluster has 3 monitor/manager VMs with 4 CPUs and 16 GB RAM, running CentOS 7.  It has 5 storage servers with 48 CPUs and 196 GB RAM, running Rocky Linux 8.  All of the Ceph daemons run in Docker containers built from Rocky Linux 8, the Ceph binaries are installed from the RPMs on download.ceph.com.  This cluster was originally installed with Hammer (IIRC) and upgraded through a number of versions (messenger v2 is enabled).  This cluster is only used for OpenStack RBD volumes, not CephFS or S3.
>
> Upgrading a monitor to Octopus 15.2.16 works fine, it starts up and rejoins the quorum.  When I upgrade to Pacific 16.2.5 or 16.2.7, it immediately crashes.  Upgrading to Pacific directly from Nautilus does the same thing.  Adding "mon_mds_skip_sanity = true" to ceph.conf doesn't change anything.  I've tried compacting and rebuilding the monitor store, it doesn't help.  I can add new Nautilus 14.2.22 monitors to the cluster, they start and join in a few seconds but updating them also crashes immediately.  I can post the entire crash output if it would help, but I think these are the relevant lines from 16.2.5:
> --------------------------------------------------------------------------------
> 2022-03-19T14:05:36.549-0500 7ffb78025700  0 starting mon.olaxps-ceph90 rank 3 at public addrs [v2:10.5.240.81:3300/0,v1:10.5.240.81:6789/0] at bind addrs [v2:10.5.240.81:3300/0,v1:10.5.240.81:6789/0] mon_data /var/lib/ceph/mon/ceph-olaxps-ceph90 fsid a7fcde57-88df-4f14-a290-d170f0bedb25
> 2022-03-19T14:05:36.550-0500 7ffb78025700  1 mon.olaxps-ceph90@-1(???) e24 preinit fsid a7fcde57-88df-4f14-a290-d170f0bedb25
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/FSMap.cc: In function 'void FSMap::decode(ceph::buffer::v15_2_0::list::const_iterator&)' thread 7ffb78025700 time 2022-03-19T14:05:36.552097-0500
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/FSMap.cc: 648: ceph_abort_msg("abort() called")
> ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
> 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7ffb6f1b3264]
> 2: (FSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0xc73) [0x7ffb6f6fa003]
> 3: (MDSMonitor::update_from_paxos(bool*)+0x18a) [0x563c5606697a]
> 4: (PaxosService::refresh(bool*)+0x10e) [0x563c55f87c7e]
> 5: (Monitor::refresh_from_paxos(bool*)+0x18c) [0x563c55e39eac]
> 6: (Monitor::init_paxos()+0x10c) [0x563c55e3a1bc]
> 7: (Monitor::preinit()+0xd30) [0x563c55e67660]
> 8: main()
> 9: __libc_start_main()
> 10: _start()
> 2022-03-19T14:05:36.551-0500 7ffb78025700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/FSMap.cc: In function 'void FSMap::decode(ceph::buffer::v15_2_0::list::const_iterator&)' thread 7ffb78025700 time 2022-03-19T14:05:36.552097-0500
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/FSMap.cc: 648: ceph_abort_msg("abort() called")
>
> ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
> 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7ffb6f1b3264]
> 2: (FSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0xc73) [0x7ffb6f6fa003]
> 3: (MDSMonitor::update_from_paxos(bool*)+0x18a) [0x563c5606697a]
> 4: (PaxosService::refresh(bool*)+0x10e) [0x563c55f87c7e]
> 5: (Monitor::refresh_from_paxos(bool*)+0x18c) [0x563c55e39eac]
> 6: (Monitor::init_paxos()+0x10c) [0x563c55e3a1bc]
> 7: (Monitor::preinit()+0xd30) [0x563c55e67660]
> 8: main()
> 9: __libc_start_main()
> 10: _start()
>
> *** Caught signal (Aborted) **
> in thread 7ffb78025700 thread_name:ceph-mon
> ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
> 1: /lib64/libpthread.so.0(+0x12c20) [0x7ffb6cca9c20]
> 2: gsignal()
> 3: abort()
> 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x7ffb6f1b3335]
> 5: (FSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0xc73) [0x7ffb6f6fa003]
> 6: (MDSMonitor::update_from_paxos(bool*)+0x18a) [0x563c5606697a]
> 7: (PaxosService::refresh(bool*)+0x10e) [0x563c55f87c7e]
> 8: (Monitor::refresh_from_paxos(bool*)+0x18c) [0x563c55e39eac]
> 9: (Monitor::init_paxos()+0x10c) [0x563c55e3a1bc]
> 10: (Monitor::preinit()+0xd30) [0x563c55e67660]
> 11: main()
> 12: __libc_start_main()
> 13: _start()
> 2022-03-19T14:05:36.553-0500 7ffb78025700 -1 *** Caught signal (Aborted) **
> in thread 7ffb78025700 thread_name:ceph-mon
>
> ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
> 1: /lib64/libpthread.so.0(+0x12c20) [0x7ffb6cca9c20]
> 2: gsignal()
> 3: abort()
> 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x7ffb6f1b3335]
> 5: (FSMap::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0xc73) [0x7ffb6f6fa003]
> 6: (MDSMonitor::update_from_paxos(bool*)+0x18a) [0x563c5606697a]
> 7: (PaxosService::refresh(bool*)+0x10e) [0x563c55f87c7e]
> 8: (Monitor::refresh_from_paxos(bool*)+0x18c) [0x563c55e39eac]
> 9: (Monitor::init_paxos()+0x10c) [0x563c55e3a1bc]
> 10: (Monitor::preinit()+0xd30) [0x563c55e67660]
> 11: main()
> 12: __libc_start_main()
> 13: _start()
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> --------------------------------------------------------------------------------
>
> And from 16.2.7:
> --------------------------------------------------------------------------------
> 2022-03-19T14:09:48.739-0500 7ff5f1209700  0 starting mon.olaxps-ceph90 rank 3 at public addrs [v2:10.5.240.81:3300/0,v1:10.5.240.81:6789/0] at bind addrs [v2:10.5.240.81:3300/0,v1:10.5.240.81:6789/0] mon_data /var/lib/ceph/mon/ceph-olaxps-ceph90 fsid a7fcde57-88df-4f14-a290-d170f0bedb25
> 2022-03-19T14:09:48.741-0500 7ff5f1209700  1 mon.olaxps-ceph90@-1(???) e24 preinit fsid a7fcde57-88df-4f14-a290-d170f0bedb25
> 2022-03-19T14:09:48.741-0500 7ff5f1209700 -1 mon.olaxps-ceph90@-1(???).mds e0 unable to decode FSMap: void FSMap::decode(ceph::buffer::v15_2_0::list::const_iterator&) no longer understand old encoding version v < 7: Malformed input
> terminate called after throwing an instance of 'ceph::buffer::v15_2_0::malformed_input'
>   what():  void FSMap::decode(ceph::buffer::v15_2_0::list::const_iterator&) no longer understand old encoding version v < 7: Malformed input
> *** Caught signal (Aborted) **
> in thread 7ff5f1209700 thread_name:ceph-mon
> ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
> 1: /lib64/libpthread.so.0(+0x12c20) [0x7ff5e60c1c20]
> 2: gsignal()
> 3: abort()
> 4: /lib64/libstdc++.so.6(+0x9009b) [0x7ff5e56d809b]
> 5: /lib64/libstdc++.so.6(+0x9653c) [0x7ff5e56de53c]
> 6: /lib64/libstdc++.so.6(+0x96597) [0x7ff5e56de597]
> 7: __cxa_rethrow()
> 8: /usr/bin/ceph-mon(+0x23256a) [0x55fa726a356a]
> 9: (PaxosService::refresh(bool*)+0x10e) [0x55fa7286e29e]
> 10: (Monitor::refresh_from_paxos(bool*)+0x18c) [0x55fa7271f2dc]
> 11: (Monitor::init_paxos()+0x10c) [0x55fa7271f5ec]
> 12: (Monitor::preinit()+0xd30) [0x55fa7274caa0]
> 13: main()
> 14: __libc_start_main()
> 15: _start()
> 2022-03-19T14:09:48.742-0500 7ff5f1209700 -1 *** Caught signal (Aborted) **
> in thread 7ff5f1209700 thread_name:ceph-mon
>
> ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
> 1: /lib64/libpthread.so.0(+0x12c20) [0x7ff5e60c1c20]
> 2: gsignal()
> 3: abort()
> 4: /lib64/libstdc++.so.6(+0x9009b) [0x7ff5e56d809b]
> 5: /lib64/libstdc++.so.6(+0x9653c) [0x7ff5e56de53c]
> 6: /lib64/libstdc++.so.6(+0x96597) [0x7ff5e56de597]
> 7: __cxa_rethrow()
> 8: /usr/bin/ceph-mon(+0x23256a) [0x55fa726a356a]
> 9: (PaxosService::refresh(bool*)+0x10e) [0x55fa7286e29e]
> 10: (Monitor::refresh_from_paxos(bool*)+0x18c) [0x55fa7271f2dc]
> 11: (Monitor::init_paxos()+0x10c) [0x55fa7271f5ec]
> 12: (Monitor::preinit()+0xd30) [0x55fa7274caa0]
> 13: main()
> 14: __libc_start_main()
> 15: _start()
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> --------------------------------------------------------------------------------
>
> Both versions seem to crash in FSMap::decode(), though the message from 16.2.7 is a little more verbose.  The stack trace looks different from https://tracker.ceph.com/issues/52820, though the "malformed input" message is the same.  I found the recent reports of the sanity checking bug in 16.2.7 (https://tracker.ceph.com/issues/54161 and https://github.com/ceph/ceph/pull/44910) but this looks like a different problem.  Just to be sure, I recompiled 16.2.7 from the SRPM with the patches from that PR applied.  They didn't help, it still crashes with the same error.
>
> This may be unrelated, but I've also tried adding a new monitor to the cluster running Octopus or Pacific -- I figured replacing the existing monitors would be just as good as upgrading.  I have tried Octopus 15.2.16, Pacific 16.2.5 and Pacific 16.2.7 without success.  Each version produces the same behavior: the existing monitors start using between 80%-350% CPU (they run on 4 CPU VMs) and their memory usage climbs out of control until they crash (their containers are limited to 12 GB RAM, they normally use less than 1 GB).  While this is happening, the cluster basically freezes -- clients cannot connect, "ceph status" times out, etc.  The logs from the existing monitors are filled with tens of millions of lines like these:
> --------------------------------------------------------------------------------
> 2022-03-19 16:05:19.854 7f426c214700  1 mon.olaxps-cephmon22@2(peon) e17  adding peer [v2:10.5.240.81:3300/0,v1:10.5.240.81:6789/0] to list of hints
> 2022-03-19 16:05:19.854 7f426c214700  1 mon.olaxps-cephmon22@2(peon) e17  adding peer [v2:10.5.240.81:3300/0,v1:10.5.240.81:6789/0] to list of hints
> 2022-03-19 16:05:19.854 7f426c214700  1 mon.olaxps-cephmon22@2(peon) e17  adding peer [v2:10.5.240.81:3300/0,v1:10.5.240.81:6789/0] to list of hints
> 2022-03-19 16:05:19.854 7f426c214700  1 mon.olaxps-cephmon22@2(peon) e17  adding peer [v2:10.5.240.81:3300/0,v1:10.5.240.81:6789/0] to list of hints
> 2022-03-19 16:05:19.854 7f426c214700  1 mon.olaxps-cephmon22@2(peon) e17  adding peer [v2:10.5.240.81:3300/0,v1:10.5.240.81:6789/0] to list of hints
> --------------------------------------------------------------------------------
> The new monitor also uses high CPU and memory but doesn't spam its logs.  It never joins the cluster and doesn't write much to disk, even after waiting almost an hour.  After reading https://www.mail-archive.com/ceph-users@xxxxxxx/msg12031.html, I added the option "mon_sync_max_payload_size = 4096" to ceph.conf on all monitors (and restarted), it didn't help.  Killing the new monitor unfreezes the cluster and returns the existing monitors to their typical CPU usage.  They don't release their excess memory without being restarted.
>
> I was able to update a similar (but newer) test cluster to Pacific, so this smells like something specific to the data in this cluster.  What else can I do to troubleshoot?  I can provide more output and config files if those would help; I didn't want to post a bunch of huge files if they aren't relevant.  Any suggestions?
>
> -- Sam Clippinger
>
> ________________________________
>
> CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx