On Fri, Jan 6, 2012 at 4:36 AM, Karoly Horvath <rhswdev@xxxxxxxxx> wrote: > Hi, > > no, this is a different problem, this time the failover was successful. Aha, a different problem indeed! I assume that mon.1 was located on beta (the computer you killed)? It turns out that the standby MDS was connected to the monitor you killed, and the MDS took a long time to time out its connection so it didn't find out it needed to go active until after the default connection timeout period had elapsed. I've created a bug to track this issue: http://tracker.newdream.net/issues/1912 (I will push a fix for it to master tonight or tomorrow). In the meantime you can work around it by only running one monitor and not killing the node it's on; if you can attempt to reproduce the previous issue that is a more interesting one! :) -Greg > 2012-01-05 12:49:48.185815 mds e376: 1/1/1 up {0=beta=up:active}, 1 up:standby > 2012-01-05 12:50:32.200055 mds e377: 1/1/1 up > {0=alpha=up:replay(laggy or crashed)} > 2012-01-05 13:05:09.800119 7fd192bfa700 mds.0.55 waiting for osdmap > 568 (which blacklists prior instance) > 2012-01-05 13:06:07.851253 7fd192bfa700 mds.0.55 request_state up:active > 2012-01-05 13:06:07.851259 7fd192bfa700 mds.0.55 beacon_send up:active > seq 526 (currently up:rejoin) > > It took 15 minutes to get to the point where it prints "waiting for > osdmap". After that it was quite fast. I hope someone will find the > problem... > > -- > Karoly Horvath > rhswdev@xxxxxxxxx > > > On Thu, Jan 5, 2012 at 7:06 PM, Gregory Farnum > <gregory.farnum@xxxxxxxxxxxxx> wrote: >> On Thu, Jan 5, 2012 at 5:24 AM, Karoly Horvath <rhswdev@xxxxxxxxx> wrote: >>> Hi, >>> >>> back from holiday. >>> >>> I did a successful power unplug test now, but the FS was unavailable >>> for 16 minutes which is clearly wrong... >>> >>> I have the log files but the MDS log is 1.2 gigabyte, if you let me >>> know which lines to filter / filter out I will upload it somewhere... >>> >>> -- >>> Karoly Horvath >> >> Assuming it's the same error as last time, the log will have a line >> that contains "waiting for osdmap n (which blacklists prior >> instance)", where n is an epoch number. >> >> Then at some later point there will be a line that looks something >> like the following: >> "2011-12-21 13:45:17.594746 7f4885307700 -- xxx.xxx.xxx.31:6800/4438 >> <== mon.2 xxx.xxx.xxx.35:6789/0 9 ==== osd_map(y..z src has 1..495) v2 >> ==== 748+0+0 (656995691 0 0) 0x1637400 con 0x163c000" >> Where y and z are an interval which contains n. (In the previous log, >> and probably here too, y=z=n.) I'm going to be interested in those two >> lines and the stuff following when the osdmap arrives. Probably I will >> only care about "objecter" lines, but it might be all of them...try >> trimming off the minute following that osdmap line; it'll probably >> contain more than I care about. :) >> -Greg >> >> >>> On Fri, Dec 23, 2011 at 12:00 AM, Gregory Farnum >>> <gregory.farnum@xxxxxxxxxxxxx> wrote: >>>> On Wed, Dec 21, 2011 at 8:43 AM, Karoly Horvath <rhswdev@xxxxxxxxx> wrote: >>>>> On Wed, Dec 21, 2011 at 4:13 PM, Gregory Farnum >>>>>>> By client I assume you mean the kernel driver.. the FS is freezed, so >>>>>>> I cannot unmount (cannot even `shutdown`).. how can I force the client >>>>>>> to reconnect? >>>>>> >>>>>> Try a lazy force unmount: >>>>>> umount -lf ceph_mnt_point/ >>>>>> And then mount again. >>>>> >>>>> wow, never heard about this, thanks.:) >>>>> will report with the next mail >>>>> >>>>> In the meantime I did one test, killing mds+osd+mon on beta, >>>>> it's jammed in '{0=alpha=up:replay}', after 45 minutes I shut it down... >>>>> I attached the logs. >>>> >>>> Oh, this is very odd! The MDS goes to sleep while it waits for an >>>> up-to-date OSDMap, but it never seems to get woken up even though I >>>> see the message sending in the OSDMap. >>>> >>>> So let's try this one more time, but this time also add in "debug >>>> objecter = 20" to the MDS config...Those logs will include everything >>>> I need, or nothing will, promise! :) >>>> -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html