MDS Problems - Solved but reporting for benefit of others

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

Just a bit of an outage with CephFS around the MDS's, I managed to get everything up and running again after a bit of head
scratching and thought I would share here what happened.

Cause
I believe the MDS's which were running as VM's suffered when the hypervisor ran out of ram and started swapping due to hypervisor
maintenance. I know this is less than ideal and have put steps in place to prevent this happening again.

Symptoms
1. Noticed that both MDS's were down, log files on both showed that they had crashed
2. After restarting MDS's, their status kept flipping between replay and reconnect
3. Now again both MDS's would crash again
4. Log files showed they seemed to keep restarting after trying to reconnect clients
5. Clients were all kernel one was 3.19 and the rest 4.8. I believe the problematic client was one of the ones running Kernel 4.8
6. Ceph is 10.2.2

Resolution
After some serious head scratching and a little bit of panicking, the fact the log files showed the restart always happened after
trying to reconnect the clients gave me the idea to try and kill the sessions on the MDS.  I first reset all the clients and waited,
but this didn't seem to have any effect and I could still see the MDS trying to reconnect to the clients. I then decided to try and
kill the sessions from the MDS end, so I shutdown the standby MDS (as they kept flipping active roles) and ran

ceph daemon mds.gp-ceph-mds1 session ls 

I then tried to kill the last session in the list

ceph daemon mds.gp-ceph-mds1 session evict <session id>

I had to keep hammering this command to get it at the right point, as the MDS was only responding for a fraction of a second.

Suddenly in my other window, where I had the tail of the MDS log, I saw a whizz of new information and then stopping with the MDS
success message. So it seems something the MDS was trying to do whilst reconnecting was upsetting it. Ceph -s updated so show MDS
was now active. Rebooting other MDS then corrected made it standby as well. Problem solved.

I have uploaded the 2 MDS logs here if any CephFS dev's are interested in taking a closer look.

http://app.sys-pro.co.uk/mds_logs.zip

Nick

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux