Re: Workaround for XFS lockup resulting in down OSDs

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 8 Feb 2017 12:07:56 +0100

Hi,

This is interesting. Do you have a bit more info about how to identify
a server which is suffering from this problem? Is there some process
(xfs* or kswapd?) we'll see as busy in top or iotop.

Also, which kernel are you using?

Cheers, Dan

On Tue, Feb 7, 2017 at 6:59 PM, Thorvald Natvig <thorvald@xxxxxxxxxxxx> wrote:
> Hi,
>
> We've encountered a small "kernel feature" in XFS using Filestore. We
> have a workaround, and would like to share in case others have the
> same problem.
>
> Under high load, on slow storage, with lots of dirty buffers and low
> memory, there's a design choice with unfortunate side-effects if you
> have multiple XFS filesystems mounted, such as often is the case when
> you have a JBOD full of drives. This results in network traffic
> stalling, leading to OSDs failing heartbeats.
>
> In short, when the kernel needs to allocate memory for anything, it
> first figures out how many pages it needs, then goes to each
> filesystem and says "release N pages". In XFS, that's implemented as
> follows:
>
> - For each AG (8 in our case):
>   - Try to lock AG
>   - Release unused buffers, up to N
> - If this point is reached, and we didn't manage to release at least N
> pages, try again, but this time wait for the lock.
>
> That last part is the problem; if the lock is currently held by, say,
> another kernel thread that is currently flushing dirty buffers, then
> the memory allocation stalls. However, we have 30 other XFS
> filesystems that could release memory, and the kernel also has a lot
> of non-filesystem memory that can be released.
>
> This manifests as OSDs going offline during high load, with other OSDs
> claiming that the OSD stopped responding to health checks. This is
> especially prevalent during cache tier flushing and large backfills,
> which can put very heavy load on the write buffers, thus increasing
> the probability of one of these events.
> In reality, the OSD is stuck in the kernel, trying to allocate buffers
> to build a TCP packet to answer the network message. As soon as the
> buffers are flushed (which can take a while), the OSD recovers, but
> now has to deal with being marked down in the monitor maps.
>
> The following systemtap changes the kernel behavior to not do the lock-waiting:
>
> probe module("xfs").function("xfs_reclaim_inodes_ag").call {
>  $flags = $flags & 2
> }
>
> Save it to a file, and run with 'stap -v -g -d kernel
> --suppress-time-limits <path-to-file>'. We've been running this for a
> few weeks, and the issue is completely gone.
>
> There was a writeup on the XFS mailing list a while ago about the same
> issue ( http://www.spinics.net/lists/linux-xfs/msg01541.html ), but
> unfortunately it didn't result in consensus on a patch. This problem
> won't exist in BlueStore, so we consider the systemtap approach a
> workaround until we're ready to deploy BlueStore.
>
> - Thorvald
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com