Hi Patrick, Thanks a lot for letting us know about this issue! By reading your fix[1] carefully, I understand the heart of this issue is that: Since Jewel, CephFS introduced a new data structure FSMap (for MultiFS), and the monitor has been using this new structure as the Paxos value, but Pre-Jewel the one stored in monitor DB was MDSMap, and the initial MDSMap will keep staying in the DB and never get trimmed if CephFS wasn't used at all. Since from Pacific, the monitor was no longer expecting the MDSMap structure from the DB, which caused the crash. In order to detect if there is any old MDSMap exists, we just need to get the oldest mdsmap from monitor DB and try to decode it with pacific ceph-dencoder We can do the below: 1. Stop one monitor (Since this has to be done during upgrade) 1. Export the binary of the first committed mdsmap from monitor DB(ceph-kvstore-tool can do this) 2. Feed the binary to the Pacific version of ceph-dencoder 3. If the binary can be decoded, then we can be sure there is no legacy data structure Otherwise, there is legacy data structure and need to have a short upgrade stop at the just-released Octopus v15.2.14 before continuing to Pacific. I've done some testing and it worked, below is the same crash stack when I use pacific ceph-dencoder to decode the mdsmap from a cluster (without cephfs) upgraded from Firefly. ~# ceph-dencoder import mdsmap.1.f2j type FSMap decode dump_json /build/ceph-dJyyVB/ceph-16.2.0/src/mds/FSMap.cc: In function 'void FSMap::decode(ceph::buffer::v15_2_0::list::const_iterator&)' thread 7fda1b03a240 time 2021-08-08T04:27:57.491978+0000 /build/ceph-dJyyVB/ceph-16.2.0/src/mds/FSMap.cc: 648: ceph_abort_msg("abort() called") ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe0) [0x7fda1e3a652d] 2: (*FSMap::decode*(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0xdca) [0x7fda1e9535aa] 3: (DencoderBase<FSMap>::decode[abi:cxx11](ceph::buffer::v15_2_0::list, unsigned long)+0x54) [0x55b3a5e6ed84] 4: main() 5: __libc_start_main() 6: _start() Aborted (core dumped) Basically, the above steps have the same workflow regarding to how monitor load the mdsmap from DB and decode it. [1] https://github.com/ceph/ceph/pull/42349 Best Regards, Dongdong Patrick Donnelly <pdonnell@xxxxxxxxxx> 于2021年8月7日周六 上午4:28写道: > Hello Linh, > > On Thu, Aug 5, 2021 at 9:12 PM Linh Vu <linh.vu@xxxxxxxxxxxxxxxxx> wrote: > > Without personally knowing the history of a cluster, is there a way to > check and see when and which release it began life as? Or check whether > such legacy data structures still exist in the mons? > > I'm not aware of an easy way to check the release a cluster started > as. And unfortunately, there is no way to check for legacy data > structures. If your cluster has used CephFS at all since Jewel, it's > very unlikely there will be any in the mon stores. If you're not sure, > best to upgrade through v15.2.14 to be safe. > > -- > Patrick Donnelly, Ph.D. > He / Him / His > Principal Software Engineer > Red Hat Sunnyvale, CA > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D > > _______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx