I believe this is a known issue [1] and that there will potentially be a new 12.1.4 RC released because of it. The tracker ticket has a link to a set of development packages that should resolve the issue in the meantime. [1] http://tracker.ceph.com/issues/20985 On Tue, Aug 15, 2017 at 9:08 AM, Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote: > After upgrading to the latest Luminous RC (12.1.3), all our OSD's are > crashing with the following assert: > > 0> 2017-08-15 08:28:49.479238 7f9b7615cd00 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h: > In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t, > coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, > bool, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, > std::set<std::basic_string<char> >*, bool) [with missing_type = > pg_missing_set<true>; std::ostringstream = std::basic_ostringstream<char>]' > thread 7f9b7615cd00 time 2017-08-15 08:28:49.477367 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h: > 1301: FAILED assert(force_rebuild_missing) > > ceph version 12.1.3 (c56d9c07b342c08419bbc18dcf2a4c5fae62b9cf) luminous > (rc) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x110) [0x55d0f2be3b50] > 2: (void PGLog::read_log_and_missing<pg_missing_set<true> >(ObjectStore*, > coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&, > pg_missing_set<true>&, bool, std::basic_ostringstream<char, > std::char_traits<char>, std::allocator<char> >&, bool, bool*, > DoutPrefixProvider const*, std::set<std::string, std::less<std::string>, > std::allocator<std::string> >*, bool)+0x773) [0x55d0f276f013] > 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x52b) > [0x55d0f272739b] > 4: (OSD::load_pgs()+0x97a) [0x55d0f2673dea] > 5: (OSD::init()+0x2179) [0x55d0f268c319] > 6: (main()+0x2def) [0x55d0f2591ccf] > 7: (__libc_start_main()+0xf5) [0x7f9b727d6b35] > 8: (()+0x4ac006) [0x55d0f2630006] > > Looking at the code in PGLog.h, the change from 12.1.2 to 12.1.3 (in > read_log_missing) was: > > if (p->key() == "divergent_priors") { > ::decode(divergent_priors, bp); > ldpp_dout(dpp, 20) << "read_log_and_missing " << > divergent_priors.size() > << " divergent_priors" << dendl; > has_divergent_priors = true; > debug_verify_stored_missing = false; > > to > > if (p->key() == "divergent_priors") { > ::decode(divergent_priors, bp); > ldpp_dout(dpp, 20) << "read_log_and_missing " << > divergent_priors.size() > << " divergent_priors" << dendl; > assert(force_rebuild_missing); > debug_verify_stored_missing = false; > > and it seems like force_rebuild_missing is not being set. > > This cluster was upgraded from Jewel to 12.1.1, then 12.1.2 and now 12.1.3. > So it seems something didn't happen correctly during the upgrade. Any ideas > how to fix it? > > Andras > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com