Re: [PATCH 2/3] mdsmap: fix mdsmap cluster available check based on laggy number

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 21 Nov 2019 12:30:17 -0500



On Wed, 2019-11-20 at 03:29 -0500, xiubli@xxxxxxxxxx wrote:
> From: Xiubo Li <xiubli@xxxxxxxxxx>
> 
> In case the max_mds > 1 in MDS cluster and there is no any standby
> MDS and all the max_mds MDSs are in up:active state, if one of the
> up:active MDSs is dead, the m->m_num_laggy in kclient will be 1.
> Then the mount will fail without considering other healthy MDSs.
> 
> Only when all the MDSs in the cluster are laggy will treat the
> cluster as not be available.
> 
> Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
> ---
>  fs/ceph/mdsmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/mdsmap.c b/fs/ceph/mdsmap.c
> index 471bac335fae..8b4f93e5b468 100644
> --- a/fs/ceph/mdsmap.c
> +++ b/fs/ceph/mdsmap.c
> @@ -396,7 +396,7 @@ bool ceph_mdsmap_is_cluster_available(struct ceph_mdsmap *m)
>  		return false;
>  	if (m->m_damaged)
>  		return false;
> -	if (m->m_num_laggy > 0)
> +	if (m->m_num_laggy == m->m_num_mds)
>  		return false;
>  	for (i = 0; i < m->m_num_mds; i++) {
>  		if (m->m_info[i].state == CEPH_MDS_STATE_ACTIVE)

Given that laggy servers are still expected to be "in" the cluster,
should we just eliminate this check altogether? It seems like we'd still
want to allow a mount to occur even if the cluster is lagging.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>