Re: Luminous OSD startup errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for the quick response and the pointer. The dev build fixed the issue.

Andras


On 08/15/2017 09:19 AM, Jason Dillaman wrote:
I believe this is a known issue [1] and that there will potentially be
a new 12.1.4 RC released because of it. The tracker ticket has a link
to a set of development packages that should resolve the issue in the
meantime.


[1] http://tracker.ceph.com/issues/20985

On Tue, Aug 15, 2017 at 9:08 AM, Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
After upgrading to the latest Luminous RC (12.1.3), all our OSD's are
crashing with the following assert:

      0> 2017-08-15 08:28:49.479238 7f9b7615cd00 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h:
In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t,
coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&,
bool, std::ostringstream&, bool, bool*, const DoutPrefixProvider*,
std::set<std::basic_string<char> >*, bool) [with missing_type =
pg_missing_set<true>; std::ostringstream = std::basic_ostringstream<char>]'
thread 7f9b7615cd00 time 2017-08-15 08:28:49.477367
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h:
1301: FAILED assert(force_rebuild_missing)

  ceph version 12.1.3 (c56d9c07b342c08419bbc18dcf2a4c5fae62b9cf) luminous
(rc)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x55d0f2be3b50]
  2: (void PGLog::read_log_and_missing<pg_missing_set<true> >(ObjectStore*,
coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&,
pg_missing_set<true>&, bool, std::basic_ostringstream<char,
std::char_traits<char>, std::allocator<char> >&, bool, bool*,
DoutPrefixProvider const*, std::set<std::string, std::less<std::string>,
std::allocator<std::string> >*, bool)+0x773) [0x55d0f276f013]
  3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x52b)
[0x55d0f272739b]
  4: (OSD::load_pgs()+0x97a) [0x55d0f2673dea]
  5: (OSD::init()+0x2179) [0x55d0f268c319]
  6: (main()+0x2def) [0x55d0f2591ccf]
  7: (__libc_start_main()+0xf5) [0x7f9b727d6b35]
  8: (()+0x4ac006) [0x55d0f2630006]

Looking at the code in PGLog.h, the change from 12.1.2 to 12.1.3 (in
read_log_missing) was:

         if (p->key() == "divergent_priors") {
           ::decode(divergent_priors, bp);
           ldpp_dout(dpp, 20) << "read_log_and_missing " <<
divergent_priors.size()
                              << " divergent_priors" << dendl;
           has_divergent_priors = true;
           debug_verify_stored_missing = false;

to

         if (p->key() == "divergent_priors") {
           ::decode(divergent_priors, bp);
           ldpp_dout(dpp, 20) << "read_log_and_missing " <<
divergent_priors.size()
                              << " divergent_priors" << dendl;
           assert(force_rebuild_missing);
           debug_verify_stored_missing = false;

and it seems like force_rebuild_missing is not being set.

This cluster was upgraded from Jewel to 12.1.1, then 12.1.2 and now 12.1.3.
So it seems something didn't happen correctly during the upgrade.  Any ideas
how to fix it?

Andras


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux