Hi, This is interesting. Do you have a bit more info about how to identify a server which is suffering from this problem? Is there some process (xfs* or kswapd?) we'll see as busy in top or iotop. Also, which kernel are you using? Cheers, Dan On Tue, Feb 7, 2017 at 6:59 PM, Thorvald Natvig <thorvald@xxxxxxxxxxxx> wrote: > Hi, > > We've encountered a small "kernel feature" in XFS using Filestore. We > have a workaround, and would like to share in case others have the > same problem. > > Under high load, on slow storage, with lots of dirty buffers and low > memory, there's a design choice with unfortunate side-effects if you > have multiple XFS filesystems mounted, such as often is the case when > you have a JBOD full of drives. This results in network traffic > stalling, leading to OSDs failing heartbeats. > > In short, when the kernel needs to allocate memory for anything, it > first figures out how many pages it needs, then goes to each > filesystem and says "release N pages". In XFS, that's implemented as > follows: > > - For each AG (8 in our case): > - Try to lock AG > - Release unused buffers, up to N > - If this point is reached, and we didn't manage to release at least N > pages, try again, but this time wait for the lock. > > That last part is the problem; if the lock is currently held by, say, > another kernel thread that is currently flushing dirty buffers, then > the memory allocation stalls. However, we have 30 other XFS > filesystems that could release memory, and the kernel also has a lot > of non-filesystem memory that can be released. > > This manifests as OSDs going offline during high load, with other OSDs > claiming that the OSD stopped responding to health checks. This is > especially prevalent during cache tier flushing and large backfills, > which can put very heavy load on the write buffers, thus increasing > the probability of one of these events. > In reality, the OSD is stuck in the kernel, trying to allocate buffers > to build a TCP packet to answer the network message. As soon as the > buffers are flushed (which can take a while), the OSD recovers, but > now has to deal with being marked down in the monitor maps. > > The following systemtap changes the kernel behavior to not do the lock-waiting: > > probe module("xfs").function("xfs_reclaim_inodes_ag").call { > $flags = $flags & 2 > } > > Save it to a file, and run with 'stap -v -g -d kernel > --suppress-time-limits <path-to-file>'. We've been running this for a > few weeks, and the issue is completely gone. > > There was a writeup on the XFS mailing list a while ago about the same > issue ( http://www.spinics.net/lists/linux-xfs/msg01541.html ), but > unfortunately it didn't result in consensus on a patch. This problem > won't exist in BlueStore, so we consider the systemtap approach a > workaround until we're ready to deploy BlueStore. > > - Thorvald > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com