Re: mds suicide

Leander Yu <leander.yu@xxxxxxxxx> · Tue, 5 Oct 2010 13:28:08 +0800

I have another OSD was marked as down however I can still access the
machine by ssh and I saw the cosd process is running.
the log shows the same pipe fault error like:
192.168.1.9:6801/1537 >> 192.168.1.25:6801/29084 pipe(0x7f7b680e2620
sd=-1 pgs=437 cs=1 l=0).fault with nothing to send, going to standby

are those two cases related?

Regards,
Leander Yu.

On Tue, Oct 5, 2010 at 1:15 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Tue, 5 Oct 2010, Leander Yu wrote:
>> Hi Sage,
>> Thanks a lot for your prompt answer.
>> So is the behavior normal? I mean if we assume there was a network issue.
>> In this case will it be better to restart the mds instead of suicide?
>> or leave it there as standby?
>
> The mds has lots of internal state that would be tricky to clean up
> properly, so one way or another the old instance should die.
>
> But you're right: probably it should just respawn a new instance instead
> of exiting?  The new instance will come back up in standby mode. Maybe
> re-exec with the same set of arguments the original instance was exectued
> with?
>
> sage
>
>
>>
>> Regards,
>> Leander Yu.
>>
>> On Tue, Oct 5, 2010 at 1:00 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> > On Mon, 4 Oct 2010, Sage Weil wrote:
>> >> On Tue, 5 Oct 2010, Leander Yu wrote:
>> >> > Hi,
>> >> > I have a 46 machines cluster(44 osd/mon + 2 mds) running ceph now. MDS
>> >> > is running in active/standby mode.
>> >> > This morning one of the MDS suicide the log shows:
>> >> >
>> >> > -------------------------------------------
>> >> > 2010-10-04 22:24:19.450022 7f2e5a1ee710 mds0.cache.ino(10000002b87)
>> >> > pop_projected_snaprealm 0x7f2e50cd9f70 seq1
>> >> > 2010-10-04 22:26:12.180854 7f2debbfb710 -- 192.168.1.103:6800/2081 >>
>> >> > 192.168.1.106:0/2453428678 pipe(0x7f2e380013d0 sd=-1 pgs=2 cs=1
>> >> > l=0).fault with nothing to send, going to standby
>> >> > 2010-10-04 22:26:12.181019 7f2e481dc710 -- 192.168.1.103:6800/2081 >>
>> >> > 192.168.1.111:0/18905730 pipe(0x7f2e38002250 sd=-1 pgs=2 cs=1
>> >> > l=0).fault with nothing to send, going to standby
>> >> > 2010-10-04 22:26:12.181041 7f2dc3fff710 -- 192.168.1.103:6800/2081 >>
>> >> > 192.168.1.114:0/1945631186 pipe(0x7f2e38000f00 sd=-1 pgs=2 cs=1
>> >> > l=0).fault with nothing to send, going to standby
>> >> > 2010-10-04 22:26:12.181149 7f2deaef6710 -- 192.168.1.103:6800/2081 >>
>> >> > 192.168.1.113:0/521184914 pipe(0x7f2e38002f90 sd=-1 pgs=2 cs=1
>> >> > l=0).fault with nothing to send, going to standby
>> >> > 2010-10-04 22:26:12.181563 7f2deb5f5710 -- 192.168.1.103:6800/2081 >>
>> >> > 192.168.1.112:0/4272114728 pipe(0x7f2e38002ac0 sd=-1 pgs=2 cs=1
>> >> > l=0).fault with nothing to send, going to standby
>> >> > 2010-10-04 22:26:13.777624 7f2e5a1ee710 mds-1.3 handle_mds_map i
>> >> > (192.168.1.103:6800/2081) dne in the mdsmap, killing myself
>> >> > 2010-10-04 22:26:13.777649 7f2e5a1ee710 mds-1.3 suicide.  wanted
>> >> > up:active, now down:dne
>> >> > 2010-10-04 22:26:13.777769 7f2e489e4710 -- 192.168.1.103:6800/2081 >>
>> >> > 192.168.1.101:0/15702 pipe(0x7f2e380008c0 sd=-1 pgs=1847 cs=1
>> >> > l=0).fault with nothing to send, going to standby
>> >> > ------------------------------------------------------------------------------
>> >> > Would you suggest how do I trouble shooting this issue? or should I
>> >> > just restart the mds to recover it?
>> >>
>> >> The MDS killed itself because it was removed from the mdsmap.  The
>> >> monitor log will tell you why if you had logging turned up.  If not, you
>> >> might find some clue by looking at each mdsmap iteration.  If you do
>> >>
>> >>  $ ceph mds stat
>> >>
>> >> it will tell you the map epoch (e###).  You can then dump any map
>> >> iteration with
>> >>
>> >>  $ ceph mds dump 123 -o -
>> >>
>> >> Work backward a few iterations until you find which epoch removed that mds
>> >> instance.  The one prior to that might have some clue (maybe it was
>> >> laggy?)...
>> >
>> > Okay, looking at the maps on your cluster, it looks like there was a
>> > standby mds, and the live one was marked down.  Probably some intermittent
>> > network issue preventing it from sending the monitor beacon on time, and
>> > the monitor decided it was dead/unresponsive.  The standby cmds took over
>> > successfully.  The recovery looks like it took about 20 seconds.
>> >
>> > sage
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html