Re: Luminous OSD startup errors

Jason Dillaman <jdillama@xxxxxxxxxx> · Tue, 15 Aug 2017 09:19:18 -0400

I believe this is a known issue [1] and that there will potentially be
a new 12.1.4 RC released because of it. The tracker ticket has a link
to a set of development packages that should resolve the issue in the
meantime.

[1] http://tracker.ceph.com/issues/20985

On Tue, Aug 15, 2017 at 9:08 AM, Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
> After upgrading to the latest Luminous RC (12.1.3), all our OSD's are
> crashing with the following assert:
>
>      0> 2017-08-15 08:28:49.479238 7f9b7615cd00 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h:
> In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t,
> coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&,
> bool, std::ostringstream&, bool, bool*, const DoutPrefixProvider*,
> std::set<std::basic_string<char> >*, bool) [with missing_type =
> pg_missing_set<true>; std::ostringstream = std::basic_ostringstream<char>]'
> thread 7f9b7615cd00 time 2017-08-15 08:28:49.477367
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h:
> 1301: FAILED assert(force_rebuild_missing)
>
>  ceph version 12.1.3 (c56d9c07b342c08419bbc18dcf2a4c5fae62b9cf) luminous
> (rc)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x110) [0x55d0f2be3b50]
>  2: (void PGLog::read_log_and_missing<pg_missing_set<true> >(ObjectStore*,
> coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&,
> pg_missing_set<true>&, bool, std::basic_ostringstream<char,
> std::char_traits<char>, std::allocator<char> >&, bool, bool*,
> DoutPrefixProvider const*, std::set<std::string, std::less<std::string>,
> std::allocator<std::string> >*, bool)+0x773) [0x55d0f276f013]
>  3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x52b)
> [0x55d0f272739b]
>  4: (OSD::load_pgs()+0x97a) [0x55d0f2673dea]
>  5: (OSD::init()+0x2179) [0x55d0f268c319]
>  6: (main()+0x2def) [0x55d0f2591ccf]
>  7: (__libc_start_main()+0xf5) [0x7f9b727d6b35]
>  8: (()+0x4ac006) [0x55d0f2630006]
>
> Looking at the code in PGLog.h, the change from 12.1.2 to 12.1.3 (in
> read_log_missing) was:
>
>         if (p->key() == "divergent_priors") {
>           ::decode(divergent_priors, bp);
>           ldpp_dout(dpp, 20) << "read_log_and_missing " <<
> divergent_priors.size()
>                              << " divergent_priors" << dendl;
>           has_divergent_priors = true;
>           debug_verify_stored_missing = false;
>
> to
>
>         if (p->key() == "divergent_priors") {
>           ::decode(divergent_priors, bp);
>           ldpp_dout(dpp, 20) << "read_log_and_missing " <<
> divergent_priors.size()
>                              << " divergent_priors" << dendl;
>           assert(force_rebuild_missing);
>           debug_verify_stored_missing = false;
>
> and it seems like force_rebuild_missing is not being set.
>
> This cluster was upgraded from Jewel to 12.1.1, then 12.1.2 and now 12.1.3.
> So it seems something didn't happen correctly during the upgrade.  Any ideas
> how to fix it?
>
> Andras
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com