On Mon, Jan 30, 2017 at 7:09 AM, Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote: > Hi, > > > > On 01/26/2017 03:34 PM, John Spray wrote: >> >> On Thu, Jan 26, 2017 at 8:18 AM, Burkhard Linke >> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote: >>> >>> HI, >>> >>> >>> we are running two MDS servers in active/standby-replay setup. Recently >>> we >>> had to disconnect active MDS server, and failover to standby works as >>> expected. >>> >>> >>> The filesystem currently contains over 5 million files, so reading all >>> the >>> metadata information from the data pool took too long, since the >>> information >>> was not available on the OSD page caches. The MDS was timed out by the >>> mons, >>> and a failover switch to the former active MDS (which was available as >>> standby again) happened. This MDS in turn had to read the metadata, again >>> running into a timeout, failover, etc. I resolved the situation by >>> disabling >>> one of the MDS, which kept the mons from failing the now solely available >>> MDS. >> >> The MDS does not re-read every inode on startup -- rather, it replays >> its journal (the overall number of files in your system does not >> factor into this). >> >>> So given a large filesystem, how do I prevent failover flapping between >>> MDS >>> instances that are in the rejoin state and reading the inode information? >> >> The monitor's decision to fail an unresponsive MDS is based on the MDS >> not sending a beacon to the mon -- there is no limit on how long an >> MDS is allowed to stay in a given state (such as rejoin). >> >> So there are two things to investigate here: >> * Why is the MDS taking so long to start? >> * Why is the MDS failing to send beacons to the monitor while it is >> in whatever process that is taking it so long? > > > Under normal operation our system has about 4.5-4.9 million active caps. > Most of them (~ 4 millions) are associated to the machine running the > nightly backups. > > I assume that during the rejoin phase, the MDS is renewing the clients' > caps. We see massive amount of small I/O on the data pool (up to > 30.000-40.000 IOPS) during the rejoin phase. Does the MDS need to access the > inode information to renew a cap? This would explain the high number of IOPS > and why the rejoin phase can take up to 20 minutes. Ah, I see. You've identified the issue - the client is informing the MDS about which inodes it has caps on, and the MDS is responding by loading those inodes -- in order to dereference them it goes via the data pool to read the backtrace on each of the inode objects. This is not a great behaviour from the MDS: doing O(files with caps) IOs, especially to the data pool, is not something we want to be doing during failovers. Things to try to mitigate this with the current code: * Using standby-replay daemons (if you're not already), so that the standby has a better chance to already have the inodes in cache and thereby avoid loading them * Increasing the MDS journal size ("mds log max segments") so that the MDS will tend to keep a longer journal and have a better chance to still have the inodes in the journal at the time the failover happens. * Decreasing "mds cache size" to limit the number of caps that can be out there at any one time I'll respond separately to ceph-devel about how we might change the code to improve this case. John > > Not sure about the second question, since the IOPS should not prevent > beacons from reaching the monitors. We will have to move the MDS servers to > different racks during this week. I'll try to bump up the debug level > before. > > > Regards, > Burkhard > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com