Workaround for XFS lockup resulting in down OSDs

Thorvald Natvig <thorvald@xxxxxxxxxxxx> · Tue, 7 Feb 2017 09:59:48 -0800

Hi,

We've encountered a small "kernel feature" in XFS using Filestore. We
have a workaround, and would like to share in case others have the
same problem.

Under high load, on slow storage, with lots of dirty buffers and low
memory, there's a design choice with unfortunate side-effects if you
have multiple XFS filesystems mounted, such as often is the case when
you have a JBOD full of drives. This results in network traffic
stalling, leading to OSDs failing heartbeats.

In short, when the kernel needs to allocate memory for anything, it
first figures out how many pages it needs, then goes to each
filesystem and says "release N pages". In XFS, that's implemented as
follows:

- For each AG (8 in our case):
  - Try to lock AG
  - Release unused buffers, up to N
- If this point is reached, and we didn't manage to release at least N
pages, try again, but this time wait for the lock.

That last part is the problem; if the lock is currently held by, say,
another kernel thread that is currently flushing dirty buffers, then
the memory allocation stalls. However, we have 30 other XFS
filesystems that could release memory, and the kernel also has a lot
of non-filesystem memory that can be released.

This manifests as OSDs going offline during high load, with other OSDs
claiming that the OSD stopped responding to health checks. This is
especially prevalent during cache tier flushing and large backfills,
which can put very heavy load on the write buffers, thus increasing
the probability of one of these events.
In reality, the OSD is stuck in the kernel, trying to allocate buffers
to build a TCP packet to answer the network message. As soon as the
buffers are flushed (which can take a while), the OSD recovers, but
now has to deal with being marked down in the monitor maps.

The following systemtap changes the kernel behavior to not do the lock-waiting:

probe module("xfs").function("xfs_reclaim_inodes_ag").call {
 $flags = $flags & 2
}

Save it to a file, and run with 'stap -v -g -d kernel
--suppress-time-limits <path-to-file>'. We've been running this for a
few weeks, and the issue is completely gone.

There was a writeup on the XFS mailing list a while ago about the same
issue ( http://www.spinics.net/lists/linux-xfs/msg01541.html ), but
unfortunately it didn't result in consensus on a patch. This problem
won't exist in BlueStore, so we consider the systemtap approach a
workaround until we're ready to deploy BlueStore.

- Thorvald
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com