Re: xfs/311 test pass but leave block device unusable

Andrea Tomassetti <andrea.tomassetti@xxxxxxxx> · Fri, 21 Jan 2022 11:20:23 +0100

On Thu, Jan 20, 2022 at 7:02 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>
> [cc xfs list]
>
> On Thu, Jan 20, 2022 at 12:04:47PM +0100, Andrea Tomassetti wrote:
> > Hi all,
> > I was using the (x)fstest utility on the Kernel 5.11 to try to
> > reproduce some xfs-related issues:
> >   INFO: task xfs-conv/dm-3:1360 blocked for more than 120 seconds.
> >   Workqueue: xfs-conv/dm-3 xfs_end_io [xfs]
> >    Call Trace:
> >     __schedule+0x44c/0x8a0
> >     schedule+0x4f/0xc0
> >     xlog_grant_head_wait+0xb5/0x1a0 [xfs]
> >     xlog_grant_head_check+0xe1/0x100 [xfs]
>
> Threads are stuck waiting for log space; can you post the full dmesg?
> And the xfs_info output of the test device?
>

I think I maybe didn't expose correctly the problems. I mentioned two
separated issues:
  1. The thread stuck problem, to which the dmesg refers to. It is
happening to us in production and we're trying to replicate it in a
test environment but without any success at the moment.
  2. While I was using xfstests, in the hope it could replicate the
problem, I stepped into the buggy behaviour of the test n. 311. And I
wanted to report this to you (and maybe get some advice on the
production's problem that is affecting us)

I hope I made myself clear, sorry for the misunderstanding.

Here the full dmesg, about the production's issue:

  [Thu Dec 23 17:13:56 2021] INFO: task xfs-conv/dm-3:1360 blocked for
more than 120 seconds.
  [Thu Dec 23 17:13:56 2021]       Not tainted 5.11.0-1020-aws
#21~20.04.2-Ubuntu
  [Thu Dec 23 17:13:56 2021] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [Thu Dec 23 17:13:56 2021] task:xfs-conv/dm-3   state:D stack:    0
pid: 1360 ppid:     2 flags:0x00004000
  [Thu Dec 23 17:13:56 2021] Workqueue: xfs-conv/dm-3 xfs_end_io [xfs]
  [Thu Dec 23 17:13:56 2021] Call Trace:
  [Thu Dec 23 17:13:56 2021]  __schedule+0x44c/0x8a0
  [Thu Dec 23 17:13:56 2021]  schedule+0x4f/0xc0
  [Thu Dec 23 17:13:56 2021]  xlog_grant_head_wait+0xb5/0x1a0 [xfs]
  [Thu Dec 23 17:13:56 2021]  xlog_grant_head_check+0xe1/0x100 [xfs]
  [Thu Dec 23 17:13:56 2021]  xfs_log_reserve+0xc2/0x1c0 [xfs]
  [Thu Dec 23 17:13:56 2021]  xfs_trans_reserve+0x1ca/0x210 [xfs]
  [Thu Dec 23 17:13:56 2021]  xfs_trans_alloc+0xd7/0x190 [xfs]
  [Thu Dec 23 17:13:56 2021]  xfs_iomap_write_unwritten+0x125/0x2e0 [xfs]
  [Thu Dec 23 17:13:56 2021]  xfs_end_ioend+0xdb/0x120 [xfs]
  [Thu Dec 23 17:13:56 2021]  xfs_end_io+0xb7/0xe0 [xfs]
  [Thu Dec 23 17:13:56 2021]  process_one_work+0x220/0x3c0
  [Thu Dec 23 17:13:56 2021]  rescuer_thread+0x2ca/0x3b0
  [Thu Dec 23 17:13:56 2021]  ? worker_thread+0x3f0/0x3f0
  [Thu Dec 23 17:13:56 2021]  kthread+0x12b/0x150
  [Thu Dec 23 17:13:56 2021]  ? set_kthread_struct+0x40/0x40
  [Thu Dec 23 17:13:56 2021]  ret_from_fork+0x22/0x30

> > When I realized that xfs test n. 311 was passing correctly but every
> > further attempt to use the block device (e.g. mount it) was failing.
> > The issue is reproducible after reboot.
> >
> > Test passed:
> >   ./check xfs/311
> >   FSTYP         -- xfs (non-debug)
> >   PLATFORM      -- Linux/x86_64 test 5.11.0-1021-aws
> > #22~20.04.2-Ubuntu SMP Wed Oct 27 21:27:13 UTC 2021
> >   MKFS_OPTIONS  -- -f /dev/xvdz
> >   MOUNT_OPTIONS -- /dev/xvdz /home/test/z
> >
> >   xfs/311 25s ...  25s
> >   Ran: xfs/311
> >   Passed all 1 tests
> >
> > Fail:
> >   # mount /dev/xvdz /home/test/z/
> >     mount: /home/test/z: /dev/xvdz already mounted or mount point busy.
> >     [ 2222.028417] /dev/xvdz: Can't open blockdev
> >
> > lsof does not show anything that is using either /dev/xvdz or /home/test/z
> >
> > Any idea why is this happening?
>
> xfs-conv handles unwritten extent conversion after writeback, so I would
> speculate (without dmesg data) that everyone got wedged trying to start
> a transaction, and the log is blocked up for whatever reason.
>

I run the test xfs/311 with the Kernel v5.13 and the behaviour is the
same as with the 5.11: after running it the disk becomes unusable.

Trying to rerun the test, after the first successful run, leads to:

  FSTYP         -- xfs (non-debug)
  PLATFORM      -- Linux/x86_64 test 5.11.0-1021-aws
#22~20.04.2-Ubuntu SMP Wed Oct 27 21:27:13 UTC 2021
  MKFS_OPTIONS  -- -f /dev/xvdz
  MOUNT_OPTIONS -- /dev/xvdz /home/test/z

  our local _scratch_mkfs routine ...
  mkfs.xfs: cannot open /dev/xvdz: Device or resource busy
  check: failed to mkfs $SCRATCH_DEV using specified options
  Interrupted!
  Passed all 0 tests

with no errors in dmesg.

xfs_info /dev/xvdz
meta-data=/dev/xvdz              isize=512    agcount=4, agsize=32768000 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=131072000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=64000, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

> > The `xlog_grant_head_wait` race issue has been resolved in a later
> > Kernel version, am I right?
>
> Beats me.
>
> > Best regards,
> > Andrea
> >
> > --
> >
> >
> >
> >
> >
> >
> >
> > The contents of this email are confidential. If the reader of this
>
> Not anymore they aren't.
>
> --D
>
> > message is not the intended recipient, you are hereby notified that any
> > dissemination, distribution or copying of this communication is strictly
> > prohibited. If you have received this communication in error, please notify
> > us immediately by replying to this message and deleting it from your
> > computer. Thank you. Devo, Inc; arco@xxxxxxxx <mailto:arco@xxxxxxxx>;
> > Calle Estébanez Calderón 3-5, 5th Floor. Madrid, Spain 28020
> >

-- 

The contents of this email are confidential. If the reader of this 
message is not the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, please notify 
us immediately by replying to this message and deleting it from your 
computer. Thank you. Devo, Inc; arco@xxxxxxxx <mailto:arco@xxxxxxxx>;  
Calle Estébanez Calderón 3-5, 5th Floor. Madrid, Spain 28020