Re: Many D state processes on XFS, kernel 4.4

Gareth Clay <gclay@xxxxxxxxxx> · Thu, 27 Apr 2017 17:01:17 +0100

Hi Brian,

Thanks very much from the response. Unfortunately we don't have logs
going back that far, so all I can say at the moment is that we're not
seeing any 'metadata I/O error' lines in the logs that we have whilst
the problem has been occurring. We're going to recreate the affected
VM and see if the problem recurs - if it does then we'll be sure to
grab the logs immediately and check.

What we can say is that this problem seems to have recurred 3 times
already on fresh VMs and disks. We initially wondered if it could be
due to a bad EBS volume or something similar, but this seems less
likely given the recurrence.

In the case of the other possible cause you mentioned, of I/O never
completing, is it possible that excessive load could call this, or
would this be more indicative of a concurrency issue at the filesystem
/ kernel level? One quirk of the workload on this machine is that we
have a lot of XFS project quotas which we're frequently checking to
report disk usage... Could it be that we're causing a starvation
problem?

Thanks again,
Gareth

On Wed, Apr 26, 2017 at 9:34 PM Brian Foster <bfoster@xxxxxxxxxx> wrote:
>
> On Wed, Apr 26, 2017 at 05:47:15PM +0100, Gareth Clay wrote:
> > Hi,
> >
> > We're trying to diagnose a problem on an AWS virtual machine with two
> > XFS filesystems, each on loop devices. The loop files are sitting on
> > an EXT4 filesystem on Amazon EBS. The VM is running lots of Linux
> > containers - we're using Overlay FS on XFS to provide the root
> > filesystems for these containers.
> >
> > The problem we're seeing is a lot of processes entering D state, stuck
> > in the xlog_grant_head_wait function. We're also seeing xfsaild/loop0
> > stuck in D state. We're not able to write to the filesystem at all on
> > this device, it seems, without the process hitting D state. Once the
> > processes enter D state they never recover, and the list of D state
> > processes seems to be growing slowly over time.
> >
> > The filesystem on loop1 seems fine (we can run ls, touch etc)
> >
> > Would anyone be able to help us to diagnose the underlying problem please?
> >
> > Following the problem reporting FAQ we've collected the following
> > details from the VM:
> >
> > uname -a:
> > Linux 8dd9526f-00ba-4f7b-aa59-a62ec661c060 4.4.0-72-generic
> > #93~14.04.1-Ubuntu SMP Fri Mar 31 15:05:15 UTC 2017 x86_64 x86_64
> > x86_64 GNU/Linux
> >
> > xfs_repair version 3.1.9
> >
> > AWS VM with 8 CPU cores and EBS storage
> >
> > And we've also collected output from /proc, xfs_info, dmesg and the
> > XFS trace tool in the following files:
> >
> > https://s3.amazonaws.com/grootfs-logs/dmesg
> > https://s3.amazonaws.com/grootfs-logs/meminfo
> > https://s3.amazonaws.com/grootfs-logs/mounts
> > https://s3.amazonaws.com/grootfs-logs/partitions
> > https://s3.amazonaws.com/grootfs-logs/trace_report.txt
> > https://s3.amazonaws.com/grootfs-logs/xfs_info
> >
>
> It looks like everything is pretty much backed up on the log and the
> tail of the log is pinned by some dquot items. The trace output shows
> that xfsaild is spinning on flush locked dquots:
>
> <...>-2737622 [001] 33449671.892834: xfs_ail_flushing:     dev 7:0 lip 0x0xffff88012e655e30 lsn 191/61681 type XFS_LI_DQUOT flags IN_AIL
> <...>-2737622 [001] 33449671.892868: xfs_ail_flushing:     dev 7:0 lip 0x0xffff8800110d7bb0 lsn 191/61681 type XFS_LI_DQUOT flags IN_AIL
> <...>-2737622 [001] 33449671.892869: xfs_ail_flushing:     dev 7:0 lip 0x0xffff88012e655a80 lsn 191/67083 type XFS_LI_DQUOT flags IN_AIL
> <...>-2737622 [001] 33449671.892869: xfs_ail_flushing:     dev 7:0 lip 0x0xffff8800110d4810 lsn 191/67296 type XFS_LI_DQUOT flags IN_AIL
> <...>-2737622 [001] 33449671.892869: xfs_ail_flushing:     dev 7:0 lip 0x0xffff880122210460 lsn 191/67310 type XFS_LI_DQUOT flags IN_AIL
>
> The cause of that is not immediately clear. One possible reason is it
> could be due to I/O failure. Do you have any I/O error messages (i.e.,
> "metadata I/O error: block ...") in your logs from before you ended up
> in this state?
>
> If not, I'm wondering if another possibility is an I/O that just never
> completes.. is this something you can reliably reproduce?
>
> Brian
>
> > Thanks for any help or advice you can offer!
> >
> > Claudia and Gareth
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html