XFS race on umount

Quentin Casasnovas <quentin.casasnovas@xxxxxxxxxx> · Fri, 10 Mar 2017 13:04:06 +0100

Hi Guys,

We've been using XFS recently on our build system because we found that it
scales pretty well and we have good use for the reflink feature :)

I think our setup is relivatively unique in that on every one of our build
server, we mount hundreds of XFS filesystem from NBD devices in parallel,
where our build environment are stored on qcow2 images and connected with
qemu-nbd, then umount them when the build is finished.  Those qcow2 images
are stored on a NFS mount, which leads to some (expected) hickups when
reading/writing blocks where sometimes the NBD layer will return some
errors to the block layer, which in turn will pass them on to XFS.  It
could be due to network contention, very high load on the server, or any
transcient error really, and in those cases, XFS will normally force shut
down the filesystem and wait for a umount.

All of this is fine and is exactly the behaviour we'd expect, though it
turns out that we keep hiting what I think is a race condition between
umount and a force shutdown from XFS itself, where I have a umount process
completely stuck in xfs_ail_push_all_sync():

  [<ffffffff813d987e>] xfs_ail_push_all_sync+0x9e/0xe0
  [<ffffffff813c20c7>] xfs_unmountfs+0x67/0x150
  [<ffffffff813c5540>] xfs_fs_put_super+0x20/0x70
  [<ffffffff811cba7a>] generic_shutdown_super+0x6a/0xf0
  [<ffffffff811cbb2b>] kill_block_super+0x2b/0x80
  [<ffffffff811cc067>] deactivate_locked_super+0x47/0x80
  [<ffffffff811ccc19>] deactivate_super+0x49/0x70
  [<ffffffff811e7b3e>] cleanup_mnt+0x3e/0x90
  [<ffffffff811e7bdd>] __cleanup_mnt+0xd/0x10
  [<ffffffff810e1b39>] task_work_run+0x79/0xa0
  [<ffffffff810c2df7>] exit_to_usermode_loop+0x4f/0x75
  [<ffffffff8100134b>] syscall_return_slowpath+0x5b/0x70
  [<ffffffff81a2cbe3>] entry_SYSCALL_64_fastpath+0x96/0x98
  [<ffffffffffffffff>] 0xffffffffffffffff

This is on a v4.10.1 kernel.  I've had a look at xfs_ail_push_all_sync()
and I wonder if there isn't a potential lost wake up problem, where I can't
see that we retest the condition after setting the current process to
TASK_UNINTERRUPTIBLE and before calling schedule() (though I know nothing
about XFS internals...).

Here's an exerpt of relevant dmesg messages that very likely happened at
the same time the unmount process was started:

  [29961.767707] block nbd74: Other side returned error (22)
  [29961.837518] XFS (nbd74): metadata I/O error: block 0x6471ba0 ("xfs_tra=
ns_read_buf_map") error 5 numblks 32
  [29961.838172] block nbd74: Other side returned error (22)
  [29961.838179] block nbd74: Other side returned error (22)
  [29961.838184] block nbd74: Other side returned error (22)
  [29961.838203] block nbd74: Other side returned error (22)
  [29961.838208] block nbd74: Other side returned error (22)
  [29962.259551] XFS (nbd74): xfs_imap_to_bp: xfs_trans_read_buf() returned=
 error -5.
  [29962.356376] XFS (nbd74): xfs_do_force_shutdown(0x8) called from line 3=
454 of file fs/xfs/xfs_inode.c.  Return address =3D 0xffffffff813bf471
  [29962.503003] XFS (nbd74): Corruption of in-memory data detected.  Shutt=
ing down filesystem
  [29963.166314] XFS (nbd74): Please umount the filesystem and rectify the =
problem(s)

I'm pretty sure the process isn't deadlocking on the spinlock because it
doesn't burn any CPU and is really out of the scheduler pool.  It should be
noted that when I noticed the hung umount process, I've manually tried to
unmount the corresponding XFS mountpoint and that was fine, though it
obviously didn't "unhang" the stuck umount process.

Any help would be appreciated :)

Thanks,
Quentin
Attachment:
signature.asc

Description: Digital signature