On Mon, Jan 30, 2017 at 6:30 AM, John Spray <jspray@xxxxxxxxxx> wrote: > This case (see forwarded) is showing that our current rejoin code is > handling situations with many capabilities quite badly -- I think we > should try and improve this soon. > > One thought I have is to just throttle the number of open_inos that we > do, so that we allow the cache to get populated with the already hit > dirfrags before trying to load more backtraces, created a ticket for > that here: http://tracker.ceph.com/issues/18730 (should be pretty > simple and doable for luminous). That would help in cases where many > of the affected inodes were in the same directory (which I expect is > all real workloads). My concern here is that if we're in a case where the caps are so scattered, just a straight throttle like that might slow us down even more as we read in directories for a cap, then throw them out. :/ > > There are probably other bigger changes we could make for this case, > such as using the path passed in cap_reconnect_t to be smarter, or > even adding a metadata pool structure that would provide super-fast > lookup of backtraces for the N most recently touched ones -- not > saying we necessarily want to go that far! I don't think we want to be doing durable storage for something like that any more than we do. I'm a little surprised this isn't handled by journaled open inodes -- are we simply dropping some of them after a long enough period without activity, or are we only journaling inode numbers? The possibility of using the paths to try and aggregate into directories makes a lot more sense to me. We ought to be able to set up lists of caps as waiters on a directory being read and then attempt to process them as stuff comes in, or else put them into a proper lookup? -Greg > > John > > > ---------- Forwarded message ---------- > From: Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> > Date: Mon, Jan 30, 2017 at 7:09 AM > Subject: Re: [ceph-users] MDS flapping: how to increase MDS timeouts? > To: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx> > > > Hi, > > > > On 01/26/2017 03:34 PM, John Spray wrote: >> >> On Thu, Jan 26, 2017 at 8:18 AM, Burkhard Linke >> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote: >>> >>> HI, >>> >>> >>> we are running two MDS servers in active/standby-replay setup. Recently we >>> had to disconnect active MDS server, and failover to standby works as >>> expected. >>> >>> >>> The filesystem currently contains over 5 million files, so reading all the >>> metadata information from the data pool took too long, since the information >>> was not available on the OSD page caches. The MDS was timed out by the mons, >>> and a failover switch to the former active MDS (which was available as >>> standby again) happened. This MDS in turn had to read the metadata, again >>> running into a timeout, failover, etc. I resolved the situation by disabling >>> one of the MDS, which kept the mons from failing the now solely available >>> MDS. >> >> The MDS does not re-read every inode on startup -- rather, it replays >> its journal (the overall number of files in your system does not >> factor into this). >> >>> So given a large filesystem, how do I prevent failover flapping between MDS >>> instances that are in the rejoin state and reading the inode information? >> >> The monitor's decision to fail an unresponsive MDS is based on the MDS >> not sending a beacon to the mon -- there is no limit on how long an >> MDS is allowed to stay in a given state (such as rejoin). >> >> So there are two things to investigate here: >> * Why is the MDS taking so long to start? >> * Why is the MDS failing to send beacons to the monitor while it is >> in whatever process that is taking it so long? > > > Under normal operation our system has about 4.5-4.9 million active > caps. Most of them (~ 4 millions) are associated to the machine > running the nightly backups. > > I assume that during the rejoin phase, the MDS is renewing the > clients' caps. We see massive amount of small I/O on the data pool (up > to 30.000-40.000 IOPS) during the rejoin phase. Does the MDS need to > access the inode information to renew a cap? This would explain the > high number of IOPS and why the rejoin phase can take up to 20 > minutes. > > Not sure about the second question, since the IOPS should not prevent > beacons from reaching the monitors. We will have to move the MDS > servers to different racks during this week. I'll try to bump up the > debug level before. > > > Regards, > Burkhard > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html