Re: Multi-MDS Failover

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Thu, 26 Apr 2018 15:37:13 -0700

On Thu, Apr 26, 2018 at 3:16 PM, Scottix <scottix@xxxxxxxxx> wrote:
> Updated to 12.2.5
>
> We are starting to test multi_mds cephfs and we are going through some
> failure scenarios in our test cluster.
>
> We are simulating a power failure to one machine and we are getting mixed
> results of what happens to the file system.
>
> This is the status of the mds once we simulate the power loss considering
> there are no more standbys.
>
> mds: cephfs-2/2/2 up
> {0=CephDeploy100=up:active,1=TigoMDS100=up:active(laggy or crashed)}
>
> 1. It is a little unclear if it is laggy or really is down, using this line
> alone.

Of course -- the mons can't tell the difference!

> 2. The first time we lost total access to ceph folder and just blocked i/o

You must have standbys for high availability. This is the docs.

> 3. One time we were still able to access ceph folder and everything seems to
> be running.

It depends(tm) on how the metadata is distributed and what locks are
held by each MDS.

Standbys are not optional in any production cluster.

-- 
Patrick Donnelly
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com