Re: a question about laggy mds

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Thu, 24 Mar 2011 09:24:12 -0700



2011/3/23 Yuki <chengmao2010@xxxxxxxxx>:
> hi
> if there is only one mds in the ceph cluster , after it was marked laggy, it can clear the laggy flag itself sometime. But if there are two mdses , the laggy one would be taken over by the standby soon. Under what circumstance the laggy mds can clear the laggy flag？
An mds is marked laggy *by the monitor* if it goes too long without
delivering the monitor a beacon message. At that point, if there is an
available mds in standby or standby-replay the laggy mds will get
blacklisted and the standby will take over. If there are no available
standbys, the mds stays laggy, and can clear itself by sending the
monitor a couple messages.

> Why does the mds send  "beacon" to the monitors? Does it just report mon its state ? The mds is marked laggy by itself not the mon, is that right ?
The beacon message contains a very minimal amount of state, and is
mostly just a heartbeat to let the monitor know that the MDS is still
alive.
As I said above, an mds is marked laggy by the monitors. :)
-Greg
PS: Please send questions like these to the list, not to individuals.
You will get better answers more quickly, and the contents of the
emails are archived and indexed by search engines so other people can
get their answer by google instead of email! :)

> Thanks .
>
> ----- Original Message -----
> From: "Gregory Farnum" <gregory.farnum@xxxxxxxxxxxxx>
> To: "huang jun" <hjwsm1989@xxxxxxxxx>
> Cc: <ceph-devel@xxxxxxxxxxxxxxx>
> Sent: Thursday, March 24, 2011 10:27 AM
> Subject: Re: a question about laggy mds
>
>
>> 2011/3/23 huang jun <hjwsm1989@xxxxxxxxx>:
>>> Hi all，
>>> There are two mds in the ceph cluster，one is active and the other is
>>> standby.  In my test,  I found the mds0 was marked as laggy, and it
>>> was taken over by the standby soon. And it will take a long time for
>>> the standby to become active if there are a great many of requests
>>> from the client. I want to know under what circumstances mds would be
>>> marked as laggy.
>>
>> The MDS gets marked laggy if it goes too long without sending a
>> "beacon" to the monitors. This generally happens if the MDS gets
>> overloaded by client requests for some reason -- or if it simply
>> crashes. Your config looks okay so either your MDS doesn't have the
>> resources it needs for the workload you're using, or the workload
>> breaks our default config/algorithms.
>>
>> The amount of time it takes for a standby to take over is generally
>> determined by 3 things:
>> 1) Time to declare an mds down (this is when it's marked laggy)
>> 2) Time to replay the MDS journal
>> 3) Time to handle client replay requests
>>
>> Usually (2) and (3) are dominated by (1), and I'm surprised this isn't
>> the case for you... What's your workload look like?
>> -Greg
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html