Re: Failing XFS filesystem underlying Ceph OSDs

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Thu, 13 Aug 2015 10:25:56 -0400

Good morning,
We have experienced one more failure like the ones originally described.  I am assuming the vm.min_free_kbytes at 256 MB helped (only one hit, OSD went down but the rest of the cluster stayed up unlike the previous massive storms).  So I went ahead and increased the vm.min_free_kbytes to 1 GB.  

I do not know of any way to reproduce the problem, or what causes it.  There is no unusual IO pattern at the time that we are aware of.

Thanks,
Alex

On Wed, Jul 22, 2015 at 8:23 AM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
Hi Dave,
On Mon, Jul 6, 2015 at 8:35 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
On Mon, Jul 06, 2015 at 03:20:19PM -0400, Alex Gorbachev wrote:

> On Sun, Jul 5, 2015 at 7:24 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:

> > On Sun, Jul 05, 2015 at 12:25:47AM -0400, Alex Gorbachev wrote:

> > > > > sysctl vm.swappiness=20 (can probably be 1 as per article)

> > > > >

> > > > > sysctl vm.min_free_kbytes=262144

> > > >

> > [...]

> > >

> > > We have experienced the problem in various guises with kernels 3.14,

> > 3.19,

> > > 4.1-rc2 and now 4.1, so it's not new to us, just different error stack.

> > > Below are some other stack dumps of what manifested as the same error.

> > >

> > >  [<ffffffff817cf4b9>] schedule+0x29/0x70

> > >  [<ffffffffc07caee7>] _xfs_log_force+0x187/0x280 [xfs]

> > >  [<ffffffff810a4150>] ? try_to_wake_up+0x2a0/0x2a0

> > >  [<ffffffffc07cb019>] xfs_log_force+0x39/0xc0 [xfs]

> > >  [<ffffffffc07d6542>] xfsaild_push+0x552/0x5a0 [xfs]

> > >  [<ffffffff817d2264>] ? schedule_timeout+0x124/0x210

> > >  [<ffffffffc07d662f>] xfsaild+0x9f/0x140 [xfs]

> > >  [<ffffffffc07d6590>] ? xfsaild_push+0x5a0/0x5a0 [xfs]

> > >  [<ffffffff81095e29>] kthread+0xc9/0xe0

> > >  [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90

> > >  [<ffffffff817d3718>] ret_from_fork+0x58/0x90

> > >  [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90

> > >  INFO: task xfsaild/sdg1:2606 blocked for more than 120 seconds.

> > >        Not tainted 3.19.4-031904-generic #201504131440

> > >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this

> > message.

> >

> > That's indicative of IO completion problems, but not a crash.

> >

> > >  BUG: unable to handle kernel NULL pointer dereference at

> >  (null)

> > >  IP: [<ffffffffc04be80f>] xfs_count_page_state+0x3f/0x70 [xfs]

> > ....

> > >   [<ffffffffc04be880>] xfs_vm_releasepage+0x40/0x120 [xfs]

> > >   [<ffffffff8118a7d2>] try_to_release_page+0x32/0x50

> > >   [<ffffffff8119fe6d>] shrink_page_list+0x69d/0x720

> > >   [<ffffffff811a058d>] shrink_inactive_list+0x1dd/0x5d0

> > ....

> >

> > Again, this is indicative of a page cache issue: a page without

> > buffers has been passed to xfs_vm_releasepage(), which implies the

> > page flags are not correct. i.e PAGE_FLAGS_PRIVATE is set but

> > page->private is null...

> >

> > Again, this is unlikely to be an XFS issue.

> >

>

> Sorry for my ignorance, but would this likely come from Ceph code or a

> hardware issue of some kind, such as a disk drive?  I have reached out to

> RedHat and Ceph community on that as well.

More likely a kernel bug somewhere in the page cache or memory

reclaim paths. The issue is that we only notice the problem long

after it has occurred. i.e. when XFS goes to tear down the page it has

been handed, the page is already in a bad state and so it doesn't

really tell us anything about the cause of the problem.

Realisticaly, we need a script that reproduces the problem (that

doesn't require a Ceph cluster) to be able to isolate the cause.

In the mean time, you can always try running  CONFIG_XFS_WARN=y to

see if that catches problems earlier, and you might also want to do

things like turn on memory poisoning and other kernel debugging

options to try to isolate the cause of the issue....

We have been error free for almost 3 weeks now with these changes:

vm.swappiness=1
vm.min_free_kbytes=262144

I wonder if this is related to us using high speed Areca HBAs with RAM writeback cache and having had vm.swappiness=0 previously.  POssibly the HBA handing down a large chunk of IO very fast and page cache not being to handle it with swappiness=0.  I will keep monitoring, but thank you very much for the analysis and info.

Alex

Cheers,

Dave.

--

Dave Chinner

david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs