Re: MDS Problems - Solved but reporting for benefit of others

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> -----Original Message-----
> From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx]
> Sent: 08 November 2016 22:55
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  MDS Problems - Solved but reporting for benefit of others
> 
> On Wed, Nov 2, 2016 at 2:49 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > A bit more digging, the original crash appears to be similar (but not
> > exactly the same) as this tracker report
> >
> > http://tracker.ceph.com/issues/16983
> >
> > I can see that this was fixed in 10.2.3, so I will probably look to upgrade.
> >
> > If the logs make sense to anybody with a bit more knowledge I would be
> > interested if that bug is related or if I have stumbled on something new.
> 
> Yep, from what's present it definitely looks like that. Good searching. :) -Greg

Hi Greg,

Not sure if you saw my later post, but I think that bug, although present, was not the complete cause of the looping replay/reconnect process. I encountered the same looping problem when I restarted the MDS's after the 10.2.3 upgrade, forcing a hard reboot of all clients to allow the MDS to settle. Another Ceph user has contacted me who is experiencing similar issues and he will be doing another restart this week to see if it happens again. I will also try another reboot within the next week. Hopefully between us we can get some more detailed logging about what is happening.

> 
> >
> > Nick
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of Nick Fisk
> >> Sent: 02 November 2016 17:58
> >> To: 'Ceph Users' <ceph-users@xxxxxxxxxxxxxx>
> >> Subject:  MDS Problems - Solved but reporting for benefit
> >> of others
> >>
> >> Hi all,
> >>
> >> Just a bit of an outage with CephFS around the MDS's, I managed to
> >> get everything up and running again after a bit of head
> > scratching
> >> and thought I would share here what happened.
> >>
> >> Cause
> >> I believe the MDS's which were running as VM's suffered when the
> >> hypervisor ran out of ram and started swapping due to hypervisor maintenance. I know this is less than ideal and have put steps in
> place to prevent this happening again.
> >>
> >> Symptoms
> >> 1. Noticed that both MDS's were down, log files on both showed that
> >> they had crashed 2. After restarting MDS's, their status kept
> >> flipping between replay and reconnect 3. Now again both MDS's would
> >> crash again 4. Log files showed they seemed to keep restarting after
> >> trying to reconnect clients 5. Clients were all kernel one was 3.19
> >> and the rest 4.8. I believe the problematic client was
> > one of the
> >> ones running Kernel 4.8 6. Ceph is 10.2.2
> >>
> >> Resolution
> >> After some serious head scratching and a little bit of panicking, the
> >> fact the log files showed the restart always happened after
> > trying
> >> to reconnect the clients gave me the idea to try and kill the
> >> sessions on the MDS.  I first reset all the clients and waited, but
> > this didn't
> >> seem to have any effect and I could still see the MDS trying to
> >> reconnect to the clients. I then decided to try and kill the
> > sessions from
> >> the MDS end, so I shutdown the standby MDS (as they kept flipping
> >> active roles) and ran
> >>
> >> ceph daemon mds.gp-ceph-mds1 session ls
> >>
> >> I then tried to kill the last session in the list
> >>
> >> ceph daemon mds.gp-ceph-mds1 session evict <session id>
> >>
> >> I had to keep hammering this command to get it at the right point, as the MDS was only responding for a fraction of a second.
> >>
> >> Suddenly in my other window, where I had the tail of the MDS log, I
> >> saw a whizz of new information and then stopping with the MDS success
> >> message. So it seems something the MDS was trying to do whilst reconnecting was upsetting it. Ceph -s updated so show MDS was
> now active. Rebooting other MDS then corrected made it standby as well. Problem solved.
> >>
> >> I have uploaded the 2 MDS logs here if any CephFS dev's are interested in taking a closer look.
> >>
> >> http://app.sys-pro.co.uk/mds_logs.zip
> >>
> >> Nick
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux