On Thu, Apr 26, 2018 at 4:40 PM, Scottix <scottix@xxxxxxxxx> wrote: >> Of course -- the mons can't tell the difference! > That is really unfortunate, it would be nice to know if the filesystem has > been degraded and to what degree. If a rank is laggy/crashed, the file system as a whole is generally unavailable. The span between partial outage and full is small and not worth quantifying. >> You must have standbys for high availability. This is the docs. > Ok but what if you have your standby go down and a master go down. This > could happen in the real world and is a valid error scenario. >Also there is > a period between when the standby becomes active what happens in-between > that time? The standby MDS goes through a series of states where it recovers the lost state and connections with clients. Finally, it goes active. >> It depends(tm) on how the metadata is distributed and what locks are > held by each MDS. > Your saying depending on which mds had a lock on a resource it will block > that particular POSIX operation? Can you clarify a little bit? > >> Standbys are not optional in any production cluster. > Of course in production I would hope people have standbys but in theory > there is no enforcement in Ceph for this other than a warning. So when you > say not optional that is not exactly true it will still run. It's self-defeating to expect CephFS to enforce having standbys -- presumably by throwing an error or becoming unavailable -- when the standbys exist to make the system available. There's nothing to enforce. A warning is sufficient for the operator that (a) they didn't configure any standbys or (b) MDS daemon processes/boxes are going away and not coming back as standbys (i.e. the pool of MDS daemons is decreasing with each failover) -- Patrick Donnelly _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com