Re: 10.2.0 - mds won't recover, waiting on journal 300

John Spray <jspray@xxxxxxxxxx> · Mon, 2 May 2016 22:19:55 +0100

On Sun, May 1, 2016 at 2:34 AM, Russ <wernerru@xxxxxxx> wrote:
> After getting all the OSDs and MONs updated and running ok, I updated the
> MDS as usual; rebooted the machine after updating the kernel (we're on
> 14.04, but it was running an older 4.x kernel, so took it to 16.04's
> version), the MDS fails to come up. No replay, no nothing.
>
> It boots normally, and then stops while waiting for the journal to recover,
> just repeating the broadcasts:
>
> 2016-04-30 21:21:33.889536 7f9f85da3700 10 mds.beacon.a _send up:replay seq
> 59
> 2016-04-30 21:21:33.889576 7f9f85da3700  1 -- 35.8.224.77:6800/31903 -->
> 35.8.224.132:6789/0 -- mdsbeacon(15227404/a up:replay seq 59 v6030) v6 --
> ?+0 0x55a7d0a72000 con 0x55a7d0934600
> 2016-04-30 21:21:33.890646 7f9f88eaa700  1 -- 35.8.224.77:6800/31903 <==
> mon.1 35.8.224.132:6789/0 70 ==== mdsbeacon(15227404/a up:replay seq 59
> v6030) v6 ==== 125+0+0 (945447566 0 0) 0x55a7d0a74700 con 0x55a7d0934600
> 2016-04-30 21:21:33.890693 7f9f88eaa700 10 mds.beacon.a handle_mds_beacon
> up:replay seq 59 rtt 0.001135
>
> Journal never does anything, but upon killing the pid, it shows:
>
> 2016-04-30 21:21:40.455902 7f9f83b9d700  4 mds.0.log Journal 300 recovered.
> 2016-04-30 21:21:40.455929 7f9f83b9d700  0 mds.0.log Journal 300 is in
> unknown format 4294967295, does this MDS daemon require upgrade?

Hmm, think this might be misleading, Journaler::shutdown completes any
context waiting for journal recovery with status 0, which is causing
MDLog::_recovery_thread to proceed even though the journaler header
hasn't been populated properly (so had a bogus version).

It seems likely that there is a problem with your RADOS cluster that
is causing the MDS's read operations to stall while it tries to read
its journal.  You could confirm this by starting the MDS, and then
while it is stuck using "ceph daemon mds.<name> objecter_requests" on
the MDS node to see what its outstanding operations are.

John

>
> Only reason the MDS got rebooted fully after the upgrades was that some
> random objects were showing unfound, yet if I shutdown one of the nodes
> housing those OSDs, the unfound count would reduce. Obviously need to deal
> with the MDS issue first haha.
>
> Hopefully someone has some insight as what can be ran to either get it back
> online as-was, nuke the journal (the metadata on-system should be ok, there
> wasn't any traffic of importance happening during the upgrades), or reset it
> so it'll pull from the metadata pool.
>
> Thanks!
>
> Russ
> CAL Tech Lead
>
> Russell Werner
> wernerru@xxxxxxx
> O: 517.884.1504 - Direct
> C: 517.803.8488 - 24/7
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com