Re: a question about laggy mds

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Wed, 23 Mar 2011 19:27:46 -0700

2011/3/23 huang jun <hjwsm1989@xxxxxxxxx>:
> Hi all，
> There are two mds in the ceph cluster，one is active and the other is
> standby.  In my test,  I found the mds0 was marked as laggy, and it
> was taken over by the standby soon. And it will take a long time for
> the standby to become active if there are a great many of requests
> from the client. I want to know under what circumstances mds would be
> marked as laggy.

The MDS gets marked laggy if it goes too long without sending a
"beacon" to the monitors. This generally happens if the MDS gets
overloaded by client requests for some reason -- or if it simply
crashes. Your config looks okay so either your MDS doesn't have the
resources it needs for the workload you're using, or the workload
breaks our default config/algorithms.

The amount of time it takes for a standby to take over is generally
determined by 3 things:
1) Time to declare an mds down (this is when it's marked laggy)
2) Time to replay the MDS journal
3) Time to handle client replay requests

Usually (2) and (3) are dominated by (1), and I'm surprised this isn't
the case for you... What's your workload look like?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html