Hi, I would also be interested in if there is a way to determine if this happening. I'm not sure if its related, but when I updated a number of OSD nodes to Kernel 4.7 from 4.4, I started seeing lots of random alerts from OSD's saying that other OSD's were not responding. The load wasn't particularly high on the cluster but I couldn't determine why the OSD's were wrongly getting marked out, it was almost like they just hung for a while, which is remarkably similar to what you describe. The only difference in my case was that I wouldn't have said my cluster was particularly busy. I read through that XFS thread you posted and it seemed to suggest the problem was between the 3.x and 4.x versions. As soon as I reverted to Kernel 4.4 (from Ubuntu) the problem stopped immediately. Nick > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Dan van der Ster > Sent: 08 February 2017 11:08 > To: Thorvald Natvig <thorvald@xxxxxxxxxxxx> > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx> > Subject: Re: Workaround for XFS lockup resulting in down OSDs > > Hi, > > This is interesting. Do you have a bit more info about how to identify a server which is suffering from this problem? Is there some > process > (xfs* or kswapd?) we'll see as busy in top or iotop. > > Also, which kernel are you using? > > Cheers, Dan > > > On Tue, Feb 7, 2017 at 6:59 PM, Thorvald Natvig <thorvald@xxxxxxxxxxxx> wrote: > > Hi, > > > > We've encountered a small "kernel feature" in XFS using Filestore. We > > have a workaround, and would like to share in case others have the > > same problem. > > > > Under high load, on slow storage, with lots of dirty buffers and low > > memory, there's a design choice with unfortunate side-effects if you > > have multiple XFS filesystems mounted, such as often is the case when > > you have a JBOD full of drives. This results in network traffic > > stalling, leading to OSDs failing heartbeats. > > > > In short, when the kernel needs to allocate memory for anything, it > > first figures out how many pages it needs, then goes to each > > filesystem and says "release N pages". In XFS, that's implemented as > > follows: > > > > - For each AG (8 in our case): > > - Try to lock AG > > - Release unused buffers, up to N > > - If this point is reached, and we didn't manage to release at least N > > pages, try again, but this time wait for the lock. > > > > That last part is the problem; if the lock is currently held by, say, > > another kernel thread that is currently flushing dirty buffers, then > > the memory allocation stalls. However, we have 30 other XFS > > filesystems that could release memory, and the kernel also has a lot > > of non-filesystem memory that can be released. > > > > This manifests as OSDs going offline during high load, with other OSDs > > claiming that the OSD stopped responding to health checks. This is > > especially prevalent during cache tier flushing and large backfills, > > which can put very heavy load on the write buffers, thus increasing > > the probability of one of these events. > > In reality, the OSD is stuck in the kernel, trying to allocate buffers > > to build a TCP packet to answer the network message. As soon as the > > buffers are flushed (which can take a while), the OSD recovers, but > > now has to deal with being marked down in the monitor maps. > > > > The following systemtap changes the kernel behavior to not do the lock-waiting: > > > > probe module("xfs").function("xfs_reclaim_inodes_ag").call { $flags = > > $flags & 2 } > > > > Save it to a file, and run with 'stap -v -g -d kernel > > --suppress-time-limits <path-to-file>'. We've been running this for a > > few weeks, and the issue is completely gone. > > > > There was a writeup on the XFS mailing list a while ago about the same > > issue ( http://www.spinics.net/lists/linux-xfs/msg01541.html ), but > > unfortunately it didn't result in consensus on a patch. This problem > > won't exist in BlueStore, so we consider the systemtap approach a > > workaround until we're ready to deploy BlueStore. > > > > - Thorvald > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com