Re: mds: Doing fewer backtrace reads during rejoin (was: MDS flapping: how to increase MDS timeouts?)

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 30 Jan 2017 13:07:59 -0800

On Mon, Jan 30, 2017 at 6:30 AM, John Spray <jspray@xxxxxxxxxx> wrote:
> This case (see forwarded) is showing that our current rejoin code is
> handling situations with many capabilities quite badly -- I think we
> should try and improve this soon.
>
> One thought I have is to just throttle the number of open_inos that we
> do, so that we allow the cache to get populated with the already hit
> dirfrags before trying to load more backtraces, created a ticket for
> that here: http://tracker.ceph.com/issues/18730 (should be pretty
> simple and doable for luminous).  That would help in cases where many
> of the affected inodes were in the same directory (which I expect is
> all real workloads).

My concern here is that if we're in a case where the caps are so
scattered, just a straight throttle like that might slow us down even
more as we read in directories for a cap, then throw them out. :/

>
> There are probably other bigger changes we could make for this case,
> such as using the path passed in cap_reconnect_t to be smarter, or
> even adding a metadata pool structure that would provide super-fast
> lookup of backtraces for the N most recently touched ones -- not
> saying we necessarily want to go that far!

I don't think we want to be doing durable storage for something like
that any more than we do. I'm a little surprised this isn't handled by
journaled open inodes -- are we simply dropping some of them after a
long enough period without activity, or are we only journaling inode
numbers?
The possibility of using the paths to try and aggregate into
directories makes a lot more sense to me. We ought to be able to set
up lists of caps as waiters on a directory being read and then attempt
to process them as stuff comes in, or else put them into a proper
lookup?
-Greg

>
> John
>
>
> ---------- Forwarded message ----------
> From: Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
> Date: Mon, Jan 30, 2017 at 7:09 AM
> Subject: Re: [ceph-users] MDS flapping: how to increase MDS timeouts?
> To: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
>
>
> Hi,
>
>
>
> On 01/26/2017 03:34 PM, John Spray wrote:
>>
>> On Thu, Jan 26, 2017 at 8:18 AM, Burkhard Linke
>> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>
>>> HI,
>>>
>>>
>>> we are running two MDS servers in active/standby-replay setup. Recently we
>>> had to disconnect active MDS server, and failover to standby works as
>>> expected.
>>>
>>>
>>> The filesystem currently contains over 5 million files, so reading all the
>>> metadata information from the data pool took too long, since the information
>>> was not available on the OSD page caches. The MDS was timed out by the mons,
>>> and a failover switch to the former active MDS (which was available as
>>> standby again) happened. This MDS in turn had to read the metadata, again
>>> running into a timeout, failover, etc. I resolved the situation by disabling
>>> one of the MDS, which kept the mons from failing the now solely available
>>> MDS.
>>
>> The MDS does not re-read every inode on startup -- rather, it replays
>> its journal (the overall number of files in your system does not
>> factor into this).
>>
>>> So given a large filesystem, how do I prevent failover flapping between MDS
>>> instances that are in the rejoin state and reading the inode information?
>>
>> The monitor's decision to fail an unresponsive MDS is based on the MDS
>> not sending a beacon to the mon -- there is no limit on how long an
>> MDS is allowed to stay in a given state (such as rejoin).
>>
>> So there are two things to investigate here:
>>   * Why is the MDS taking so long to start?
>>   * Why is the MDS failing to send beacons to the monitor while it is
>> in whatever process that is taking it so long?
>
>
> Under normal operation our system has about 4.5-4.9 million active
> caps. Most of them (~ 4 millions) are associated to the machine
> running the nightly backups.
>
> I assume that during the rejoin phase, the MDS is renewing the
> clients' caps. We see massive amount of small I/O on the data pool (up
> to 30.000-40.000 IOPS) during the rejoin phase. Does the MDS need to
> access the inode information to renew a cap? This would explain the
> high number of IOPS and why the rejoin phase can take up to 20
> minutes.
>
> Not sure about the second question, since the IOPS should not prevent
> beacons from reaching the monitors. We will have to move the MDS
> servers to different racks during this week. I'll try to bump up the
> debug level before.
>
>
> Regards,
> Burkhard
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html