After upgrading to the latest Luminous RC (12.1.3), all our OSD's are
crashing with the following assert:
0> 2017-08-15 08:28:49.479238 7f9b7615cd00 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h:
In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t,
coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&,
bool, std::ostringstream&, bool, bool*, const DoutPrefixProvider*,
std::set<std::basic_string<char> >*, bool) [with missing_type =
pg_missing_set<true>; std::ostringstream = std::basic_ostringstream<char>]'
thread 7f9b7615cd00 time 2017-08-15 08:28:49.477367
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h:
1301: FAILED assert(force_rebuild_missing)
ceph version 12.1.3 (c56d9c07b342c08419bbc18dcf2a4c5fae62b9cf) luminous
(rc)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x55d0f2be3b50]
2: (void PGLog::read_log_and_missing<pg_missing_set<true> >(ObjectStore*,
coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&,
pg_missing_set<true>&, bool, std::basic_ostringstream<char,
std::char_traits<char>, std::allocator<char> >&, bool, bool*,
DoutPrefixProvider const*, std::set<std::string, std::less<std::string>,
std::allocator<std::string> >*, bool)+0x773) [0x55d0f276f013]
3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x52b)
[0x55d0f272739b]
4: (OSD::load_pgs()+0x97a) [0x55d0f2673dea]
5: (OSD::init()+0x2179) [0x55d0f268c319]
6: (main()+0x2def) [0x55d0f2591ccf]
7: (__libc_start_main()+0xf5) [0x7f9b727d6b35]
8: (()+0x4ac006) [0x55d0f2630006]
Looking at the code in PGLog.h, the change from 12.1.2 to 12.1.3 (in
read_log_missing) was:
if (p->key() == "divergent_priors") {
::decode(divergent_priors, bp);
ldpp_dout(dpp, 20) << "read_log_and_missing " <<
divergent_priors.size()
<< " divergent_priors" << dendl;
has_divergent_priors = true;
debug_verify_stored_missing = false;
to
if (p->key() == "divergent_priors") {
::decode(divergent_priors, bp);
ldpp_dout(dpp, 20) << "read_log_and_missing " <<
divergent_priors.size()
<< " divergent_priors" << dendl;
assert(force_rebuild_missing);
debug_verify_stored_missing = false;
and it seems like force_rebuild_missing is not being set.
This cluster was upgraded from Jewel to 12.1.1, then 12.1.2 and now 12.1.3.
So it seems something didn't happen correctly during the upgrade. Any ideas
how to fix it?
Andras
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com