Re: Multi-MDS Failover

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Scott,

Multi MDS just assigns different parts of the namespace to different
"ranks". Each rank (0, 1, 2, ...) is handled by one of the active
MDSs. (You can query which parts of the name space are assigned to
each rank using the jq tricks in [1]). If a rank is down and there are
no more standby's, then you need to bring up a new MDS to handle that
down rank. In the meantime, part of the namespace will have IO
blocked.

To handle these failures, you need to configure sufficient standby
MDSs to handle the failure scenarios you foresee in your environment.
A strictly "standby" MDS can takeover from *any* of the failed ranks,
and you can have several "standby" MDSs to cover multiple failures. So
just run 2 or 3 standby's if you want to be on the safe side.

You can also configure "standby-for-rank" MDSs -- that is, a given
standby MDS can be watching a specific rank then taking over it that
specific MDS fails. Those standby-for-rank MDS's can even be "hot"
standby's to speed up the failover process.

An active MDS for a given rank does not act as a standby for the other
ranks. I'm not sure if it *could* following some code changes, but
anyway that just not how it works today.

Does that clarify things?

Cheers, Dan

[1] https://ceph.com/community/new-luminous-cephfs-subtree-pinning/


On Fri, Apr 27, 2018 at 4:04 AM, Scottix <scottix@xxxxxxxxx> wrote:
> Ok let me try to explain this better, we are doing this back and forth and
> its not going anywhere. I'll just be as genuine as I can and explain the
> issue.
>
> What we are testing is a critical failure scenario and actually more of a
> real world scenario. Basically just what happens when it is 1AM and the shit
> hits the fan, half of your servers are down and 1 of the 3 MDS boxes are
> still alive.
> There is one very important fact that happens with CephFS and when the
> single Active MDS server fails. It is guaranteed 100% all IO is blocked. No
> split-brain, no corrupted data, 100% guaranteed ever since we started using
> CephFS
>
> Now with multi_mds, I understand this changes the logic and I understand how
> difficult and how hard this problem is, trust me I would not be able to
> tackle this. Basically I need to answer the question; what happens when 1 of
> 2 multi_mds fails with no standbys ready to come save them?
> What I have tested is not the same of a single active MDS; this absolutely
> changes the logic of what happens and how we troubleshoot. The CephFS is
> still alive and it does allow operations and does allow resources to go
> through. How, why and what is affected are very relevant questions if this
> is what the failure looks like since it is not 100% blocking.
>
> This is the problem, I have programs writing a massive amount of data and I
> don't want it corrupted or lost. I need to know what happens and I need to
> have guarantees.
>
> Best
>
>
> On Thu, Apr 26, 2018 at 5:03 PM Patrick Donnelly <pdonnell@xxxxxxxxxx>
> wrote:
>>
>> On Thu, Apr 26, 2018 at 4:40 PM, Scottix <scottix@xxxxxxxxx> wrote:
>> >> Of course -- the mons can't tell the difference!
>> > That is really unfortunate, it would be nice to know if the filesystem
>> > has
>> > been degraded and to what degree.
>>
>> If a rank is laggy/crashed, the file system as a whole is generally
>> unavailable. The span between partial outage and full is small and not
>> worth quantifying.
>>
>> >> You must have standbys for high availability. This is the docs.
>> > Ok but what if you have your standby go down and a master go down. This
>> > could happen in the real world and is a valid error scenario.
>> >Also there is
>> > a period between when the standby becomes active what happens in-between
>> > that time?
>>
>> The standby MDS goes through a series of states where it recovers the
>> lost state and connections with clients. Finally, it goes active.
>>
>> >> It depends(tm) on how the metadata is distributed and what locks are
>> > held by each MDS.
>> > Your saying depending on which mds had a lock on a resource it will
>> > block
>> > that particular POSIX operation? Can you clarify a little bit?
>> >
>> >> Standbys are not optional in any production cluster.
>> > Of course in production I would hope people have standbys but in theory
>> > there is no enforcement in Ceph for this other than a warning. So when
>> > you
>> > say not optional that is not exactly true it will still run.
>>
>> It's self-defeating to expect CephFS to enforce having standbys --
>> presumably by throwing an error or becoming unavailable -- when the
>> standbys exist to make the system available.
>>
>> There's nothing to enforce. A warning is sufficient for the operator
>> that (a) they didn't configure any standbys or (b) MDS daemon
>> processes/boxes are going away and not coming back as standbys (i.e.
>> the pool of MDS daemons is decreasing with each failover)
>>
>> --
>> Patrick Donnelly
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux