On Wed, Nov 2, 2016 at 2:49 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > A bit more digging, the original crash appears to be similar (but not exactly the same) as this tracker report > > http://tracker.ceph.com/issues/16983 > > I can see that this was fixed in 10.2.3, so I will probably look to upgrade. > > If the logs make sense to anybody with a bit more knowledge I would be interested if that bug is related or if I have stumbled on > something new. Yep, from what's present it definitely looks like that. Good searching. :) -Greg > > Nick > >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Nick Fisk >> Sent: 02 November 2016 17:58 >> To: 'Ceph Users' <ceph-users@xxxxxxxxxxxxxx> >> Subject: MDS Problems - Solved but reporting for benefit of others >> >> Hi all, >> >> Just a bit of an outage with CephFS around the MDS's, I managed to get everything up and running again after a bit of head > scratching >> and thought I would share here what happened. >> >> Cause >> I believe the MDS's which were running as VM's suffered when the hypervisor ran out of ram and started swapping due to hypervisor >> maintenance. I know this is less than ideal and have put steps in place to prevent this happening again. >> >> Symptoms >> 1. Noticed that both MDS's were down, log files on both showed that they had crashed 2. After restarting MDS's, their status kept >> flipping between replay and reconnect 3. Now again both MDS's would crash again 4. Log files showed they seemed to keep restarting >> after trying to reconnect clients 5. Clients were all kernel one was 3.19 and the rest 4.8. I believe the problematic client was > one of the >> ones running Kernel 4.8 6. Ceph is 10.2.2 >> >> Resolution >> After some serious head scratching and a little bit of panicking, the fact the log files showed the restart always happened after > trying >> to reconnect the clients gave me the idea to try and kill the sessions on the MDS. I first reset all the clients and waited, but > this didn't >> seem to have any effect and I could still see the MDS trying to reconnect to the clients. I then decided to try and kill the > sessions from >> the MDS end, so I shutdown the standby MDS (as they kept flipping active roles) and ran >> >> ceph daemon mds.gp-ceph-mds1 session ls >> >> I then tried to kill the last session in the list >> >> ceph daemon mds.gp-ceph-mds1 session evict <session id> >> >> I had to keep hammering this command to get it at the right point, as the MDS was only responding for a fraction of a second. >> >> Suddenly in my other window, where I had the tail of the MDS log, I saw a whizz of new information and then stopping with the MDS >> success message. So it seems something the MDS was trying to do whilst reconnecting was upsetting it. Ceph -s updated so show >> MDS was now active. Rebooting other MDS then corrected made it standby as well. Problem solved. >> >> I have uploaded the 2 MDS logs here if any CephFS dev's are interested in taking a closer look. >> >> http://app.sys-pro.co.uk/mds_logs.zip >> >> Nick >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com