Multi-MDS Failover

Scottix <scottix@xxxxxxxxx> · Thu, 26 Apr 2018 22:16:25 +0000

Updated to 12.2.5

We are starting to test multi_mds cephfs and we are going through some failure scenarios in our test cluster.
We are simulating a power failure to one machine and we are getting mixed results of what happens to the file system.

This is the status of the mds once we simulate the power loss considering there are no more standbys.

mds: cephfs-2/2/2 up  {0=CephDeploy100=up:active,1=TigoMDS100=up:active(laggy or crashed)}

1. It is a little unclear if it is laggy or really is down, using this line alone.
2. The first time we lost total access to ceph folder and just blocked i/o
3. One time we were still able to access ceph folder and everything seems to be running.
4. One time we had a script creating a bunch of files, simulated the crash, then we list the directory and showed 0 files, expected should be lots of files.

I mean we could go into details of each of those, but really I am trying to understand ceph logic in dealing with a crashed multi mds or if you mark it degraded? or what is going on.

It just seems a little unclear what is going to happen.

Good news once it comes back online everything is as it should be.

Thanks
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com