Re: Workaround for XFS lockup resulting in down OSDs

Nick Fisk <nick@xxxxxxxxxx> · Wed, 8 Feb 2017 11:48:24 -0000

Hi,

I would also be interested in if there is a way to determine if this happening. I'm not sure if its related, but when I updated a
number of OSD nodes to Kernel 4.7 from 4.4, I started seeing lots of random alerts from OSD's saying that other OSD's were not
responding. The load wasn't particularly high on the cluster but I couldn't determine why the OSD's were wrongly getting marked out,
it was almost like they just hung for a while, which is remarkably similar to what you describe. The only difference in my case was
that I wouldn't have said my cluster was particularly busy.

I read through that XFS thread you posted and it seemed to suggest the problem was between the 3.x and 4.x versions.

As soon as I reverted to Kernel 4.4 (from Ubuntu) the problem stopped immediately.

Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Dan van der Ster
> Sent: 08 February 2017 11:08
> To: Thorvald Natvig <thorvald@xxxxxxxxxxxx>
> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Workaround for XFS lockup resulting in down OSDs
> 
> Hi,
> 
> This is interesting. Do you have a bit more info about how to identify a server which is suffering from this problem? Is there
some
> process
> (xfs* or kswapd?) we'll see as busy in top or iotop.
> 
> Also, which kernel are you using?
> 
> Cheers, Dan
> 
> 
> On Tue, Feb 7, 2017 at 6:59 PM, Thorvald Natvig <thorvald@xxxxxxxxxxxx> wrote:
> > Hi,
> >
> > We've encountered a small "kernel feature" in XFS using Filestore. We
> > have a workaround, and would like to share in case others have the
> > same problem.
> >
> > Under high load, on slow storage, with lots of dirty buffers and low
> > memory, there's a design choice with unfortunate side-effects if you
> > have multiple XFS filesystems mounted, such as often is the case when
> > you have a JBOD full of drives. This results in network traffic
> > stalling, leading to OSDs failing heartbeats.
> >
> > In short, when the kernel needs to allocate memory for anything, it
> > first figures out how many pages it needs, then goes to each
> > filesystem and says "release N pages". In XFS, that's implemented as
> > follows:
> >
> > - For each AG (8 in our case):
> >   - Try to lock AG
> >   - Release unused buffers, up to N
> > - If this point is reached, and we didn't manage to release at least N
> > pages, try again, but this time wait for the lock.
> >
> > That last part is the problem; if the lock is currently held by, say,
> > another kernel thread that is currently flushing dirty buffers, then
> > the memory allocation stalls. However, we have 30 other XFS
> > filesystems that could release memory, and the kernel also has a lot
> > of non-filesystem memory that can be released.
> >
> > This manifests as OSDs going offline during high load, with other OSDs
> > claiming that the OSD stopped responding to health checks. This is
> > especially prevalent during cache tier flushing and large backfills,
> > which can put very heavy load on the write buffers, thus increasing
> > the probability of one of these events.
> > In reality, the OSD is stuck in the kernel, trying to allocate buffers
> > to build a TCP packet to answer the network message. As soon as the
> > buffers are flushed (which can take a while), the OSD recovers, but
> > now has to deal with being marked down in the monitor maps.
> >
> > The following systemtap changes the kernel behavior to not do the lock-waiting:
> >
> > probe module("xfs").function("xfs_reclaim_inodes_ag").call {  $flags =
> > $flags & 2 }
> >
> > Save it to a file, and run with 'stap -v -g -d kernel
> > --suppress-time-limits <path-to-file>'. We've been running this for a
> > few weeks, and the issue is completely gone.
> >
> > There was a writeup on the XFS mailing list a while ago about the same
> > issue ( http://www.spinics.net/lists/linux-xfs/msg01541.html ), but
> > unfortunately it didn't result in consensus on a patch. This problem
> > won't exist in BlueStore, so we consider the systemtap approach a
> > workaround until we're ready to deploy BlueStore.
> >
> > - Thorvald
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com