Re: Workaround for XFS lockup resulting in down OSDs

Thorvald Natvig <thorvald@xxxxxxxxxxxx> · Wed, 8 Feb 2017 09:55:14 -0800

Hi,

We've encountered this on both 4.4 and 4.8. It might have been there
earlier, but we have no data for that anymore.

While there's no causation, there's a high correlation with kernel
page allocation failures. If you see "[timestamp] <function>: page
allocation failure: order:5,
mode:0x2082120(GFP_ATOMIC|__GFP_COLD|__GFP_MEMALLOC)" or similar in
dmesg, that indicates you're in a low-memory situation with high
memory pressure. Our first workaround was to keep an eye on
/proc/buddyinfo and to do

sync
echo 1 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory

whenever the bucket for order 5 got less than 10.

A pretty dead giveaway is that e.g. SSH will start freezing up, and
your shell becomes laggy with pauses up to a second. This happens when
you try to run a command, which allocates memory, and hence runs into
the same situation of the kernel waiting for another kernel process
which flushes buffers, despite having (in our case) about 100GB other
memory that could be released in an instant.

If you see timeouts happening while /proc/buddyinfo indicates you have
plenty of high-order memory available, then it's not this issue.

- Thorvald

On Wed, Feb 8, 2017 at 3:48 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> Hi,
>
> I would also be interested in if there is a way to determine if this happening. I'm not sure if its related, but when I updated a
> number of OSD nodes to Kernel 4.7 from 4.4, I started seeing lots of random alerts from OSD's saying that other OSD's were not
> responding. The load wasn't particularly high on the cluster but I couldn't determine why the OSD's were wrongly getting marked out,
> it was almost like they just hung for a while, which is remarkably similar to what you describe. The only difference in my case was
> that I wouldn't have said my cluster was particularly busy.
>
> I read through that XFS thread you posted and it seemed to suggest the problem was between the 3.x and 4.x versions.
>
> As soon as I reverted to Kernel 4.4 (from Ubuntu) the problem stopped immediately.
>
> Nick
>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Dan van der Ster
>> Sent: 08 February 2017 11:08
>> To: Thorvald Natvig <thorvald@xxxxxxxxxxxx>
>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> Subject: Re:  Workaround for XFS lockup resulting in down OSDs
>>
>> Hi,
>>
>> This is interesting. Do you have a bit more info about how to identify a server which is suffering from this problem? Is there
> some
>> process
>> (xfs* or kswapd?) we'll see as busy in top or iotop.
>>
>> Also, which kernel are you using?
>>
>> Cheers, Dan
>>
>>
>> On Tue, Feb 7, 2017 at 6:59 PM, Thorvald Natvig <thorvald@xxxxxxxxxxxx> wrote:
>> > Hi,
>> >
>> > We've encountered a small "kernel feature" in XFS using Filestore. We
>> > have a workaround, and would like to share in case others have the
>> > same problem.
>> >
>> > Under high load, on slow storage, with lots of dirty buffers and low
>> > memory, there's a design choice with unfortunate side-effects if you
>> > have multiple XFS filesystems mounted, such as often is the case when
>> > you have a JBOD full of drives. This results in network traffic
>> > stalling, leading to OSDs failing heartbeats.
>> >
>> > In short, when the kernel needs to allocate memory for anything, it
>> > first figures out how many pages it needs, then goes to each
>> > filesystem and says "release N pages". In XFS, that's implemented as
>> > follows:
>> >
>> > - For each AG (8 in our case):
>> >   - Try to lock AG
>> >   - Release unused buffers, up to N
>> > - If this point is reached, and we didn't manage to release at least N
>> > pages, try again, but this time wait for the lock.
>> >
>> > That last part is the problem; if the lock is currently held by, say,
>> > another kernel thread that is currently flushing dirty buffers, then
>> > the memory allocation stalls. However, we have 30 other XFS
>> > filesystems that could release memory, and the kernel also has a lot
>> > of non-filesystem memory that can be released.
>> >
>> > This manifests as OSDs going offline during high load, with other OSDs
>> > claiming that the OSD stopped responding to health checks. This is
>> > especially prevalent during cache tier flushing and large backfills,
>> > which can put very heavy load on the write buffers, thus increasing
>> > the probability of one of these events.
>> > In reality, the OSD is stuck in the kernel, trying to allocate buffers
>> > to build a TCP packet to answer the network message. As soon as the
>> > buffers are flushed (which can take a while), the OSD recovers, but
>> > now has to deal with being marked down in the monitor maps.
>> >
>> > The following systemtap changes the kernel behavior to not do the lock-waiting:
>> >
>> > probe module("xfs").function("xfs_reclaim_inodes_ag").call {  $flags =
>> > $flags & 2 }
>> >
>> > Save it to a file, and run with 'stap -v -g -d kernel
>> > --suppress-time-limits <path-to-file>'. We've been running this for a
>> > few weeks, and the issue is completely gone.
>> >
>> > There was a writeup on the XFS mailing list a while ago about the same
>> > issue ( http://www.spinics.net/lists/linux-xfs/msg01541.html ), but
>> > unfortunately it didn't result in consensus on a patch. This problem
>> > won't exist in BlueStore, so we consider the systemtap approach a
>> > workaround until we're ready to deploy BlueStore.
>> >
>> > - Thorvald
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com