Re: upgraded to Ubuntu 16.04, getting assert failure

John Spray <jspray@xxxxxxxxxx> · Mon, 11 Apr 2016 19:29:20 +0100

On Sun, Apr 10, 2016 at 4:12 AM, Don Waterloo <don.waterloo@xxxxxxxxx> wrote:
> I have a 6 osd system (w/ 3 mon, and 3 mds).
> it is running cephfs as part of its task.
>
> i have upgraded the 3 mon nodes to Ubuntu 16.04 and the bundled ceph
> 10.1.0-0ubuntu1.
>
> (upgraded from Ubuntu 15.10 with ceph 0.94.6-0ubuntu0.15.10.1).
>
> 2 of the mon nodes are happy and up. But the 3rd is giving an asset failure
> on start.
> specifically the assert is:
> mds/FSMap.cc: 555: FAILED assert(i.second.state == MDSMap::STATE_STANDBY)
>
> The 'ceph status' is showing 3 mds (1 up active, 2 up standby);
>
> # ceph status
> 2016-04-10 03:08:24.522804 7f2be870c700  0 -- :/1760247070 >>
> 10.100.10.62:6789/0 pipe(0x7f2be405a2f0 sd=3 :0 s=1 pgs=0 cs=0 l=1
> c=0x7f2be405bf90).fault
>     cluster b23abffc-71c4-4464-9449-3f2c9fbe1ded
>      health HEALTH_WARN
>             crush map has legacy tunables (require bobtail, min is firefly)
>             1 mons down, quorum 0,1 nubo-1,nubo-2
>      monmap e1: 3 mons at
> {nubo-1=10.100.10.60:6789/0,nubo-2=10.100.10.61:6789/0,nubo-3=10.100.10.62:6789/0}
>             election epoch 2778, quorum 0,1 nubo-1,nubo-2
>      mdsmap e1279: 1/1/1 up {0:0=nubo-2=up:active}, 2 up:standby
>      osdmap e5666: 6 osds: 6 up, 6 in
>       pgmap v1476810: 712 pgs, 5 pools, 41976 MB data, 109 kobjects
>             86310 MB used, 5538 GB / 5622 GB avail
>                  712 active+clean
>
> I'm not sure what to do @ this stage. I've rebooted all of them, i've tried
> taking the 2 standby MDS down. I don't see why this mon fails when the
> others succeed.
>
> Does anyone have any suggestions?
>
> The stack trace from the assert gives:
>  1: (()+0x51fb9d) [0x5572d9e42b9d]
>  2: (()+0x113e0) [0x7fa285f8b3e0]
>  3: (gsignal()+0x38) [0x7fa28416b518]
>  4: (abort()+0x16a) [0x7fa28416d0ea]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x26b) [0x5572d9f7082b]
>  6: (FSMap::sanity() const+0x9ae) [0x5572d9e84f4e]
>  7: (MDSMonitor::update_from_paxos(bool*)+0x313) [0x5572d9c7e8f3]
>  8: (PaxosService::refresh(bool*)+0x3dd) [0x5572d9c012dd]
>  9: (Monitor::refresh_from_paxos(bool*)+0x193) [0x5572d9b99693]
>  10: (Monitor::init_paxos()+0x115) [0x5572d9b99ad5]
>  11: (Monitor::preinit()+0x902) [0x5572d9bca252]
>  12: (main()+0x255b) [0x5572d9b3ec9b]
>  13: (__libc_start_main()+0xf1) [0x7fa284156841]
>  14: (_start()+0x29) [0x5572d9b8b869]

Please provide the full log from the mon starting up to it crashing,
with "debug mon = 10" set.

If the mons are really all running the same code but only one is
failing, presumably that one has somehow during the upgrade process
ended up storing something invalid in its local stores while the
others have somehow proceeded past that version already.

v10.1.1 (i.e. Jewel, when it is released) has a configuration option
(mon_mds_skip_sanity) that may allow you to get past this, assuming
what's in the leader's store is indeed valid (guessing it is since
your other two mons are apparently happy).

I don't know exactly how the Ubuntu release process works, but you
should be aware that the Ceph version you're running is pre-release
code from the jewel branch.

If your CephFS data pool happens to have ID 0, you will also hit a
severe bug in that code, and you should stop using it now (see the
note here: http://blog.gmane.org/gmane.comp.file-systems.ceph.announce)

John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com