Re: MDS flapping: how to increase MDS timeouts?

John Spray <jspray@xxxxxxxxxx> · Mon, 30 Jan 2017 13:16:58 +0000

On Mon, Jan 30, 2017 at 7:09 AM, Burkhard Linke
<Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> Hi,
>
>
>
> On 01/26/2017 03:34 PM, John Spray wrote:
>>
>> On Thu, Jan 26, 2017 at 8:18 AM, Burkhard Linke
>> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>
>>> HI,
>>>
>>>
>>> we are running two MDS servers in active/standby-replay setup. Recently
>>> we
>>> had to disconnect active MDS server, and failover to standby works as
>>> expected.
>>>
>>>
>>> The filesystem currently contains over 5 million files, so reading all
>>> the
>>> metadata information from the data pool took too long, since the
>>> information
>>> was not available on the OSD page caches. The MDS was timed out by the
>>> mons,
>>> and a failover switch to the former active MDS (which was available as
>>> standby again) happened. This MDS in turn had to read the metadata, again
>>> running into a timeout, failover, etc. I resolved the situation by
>>> disabling
>>> one of the MDS, which kept the mons from failing the now solely available
>>> MDS.
>>
>> The MDS does not re-read every inode on startup -- rather, it replays
>> its journal (the overall number of files in your system does not
>> factor into this).
>>
>>> So given a large filesystem, how do I prevent failover flapping between
>>> MDS
>>> instances that are in the rejoin state and reading the inode information?
>>
>> The monitor's decision to fail an unresponsive MDS is based on the MDS
>> not sending a beacon to the mon -- there is no limit on how long an
>> MDS is allowed to stay in a given state (such as rejoin).
>>
>> So there are two things to investigate here:
>>   * Why is the MDS taking so long to start?
>>   * Why is the MDS failing to send beacons to the monitor while it is
>> in whatever process that is taking it so long?
>
>
> Under normal operation our system has about 4.5-4.9 million active caps.
> Most of them (~ 4 millions) are associated to the machine running the
> nightly backups.
>
> I assume that during the rejoin phase, the MDS is renewing the clients'
> caps. We see massive amount of small I/O on the data pool (up to
> 30.000-40.000 IOPS) during the rejoin phase. Does the MDS need to access the
> inode information to renew a cap? This would explain the high number of IOPS
> and why the rejoin phase can take up to 20 minutes.

Ah, I see.  You've identified the issue - the client is informing the
MDS about which inodes it has caps on, and the MDS is responding by
loading those inodes -- in order to dereference them it goes via the
data pool to read the backtrace on each of the inode objects.

This is not a great behaviour from the MDS: doing O(files with caps)
IOs, especially to the data pool, is not something we want to be doing
during failovers.

Things to try to mitigate this with the current code:
 * Using standby-replay daemons (if you're not already), so that the
standby has a better chance to already have the inodes in cache and
thereby avoid loading them
 * Increasing the MDS journal size ("mds log max segments") so that
the MDS will tend to keep a longer journal and have a better chance to
still have the inodes in the journal at the time the failover happens.
 * Decreasing "mds cache size" to limit the number of caps that can be
out there at any one time

I'll respond separately to ceph-devel about how we might change the
code to improve this case.

John

>
> Not sure about the second question, since the IOPS should not prevent
> beacons from reaching the monitors. We will have to move the MDS servers to
> different racks during this week. I'll try to bump up the debug level
> before.
>
>
> Regards,
> Burkhard
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com