On Thu, Jan 20, 2022 at 7:02 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > [cc xfs list] > > On Thu, Jan 20, 2022 at 12:04:47PM +0100, Andrea Tomassetti wrote: > > Hi all, > > I was using the (x)fstest utility on the Kernel 5.11 to try to > > reproduce some xfs-related issues: > > INFO: task xfs-conv/dm-3:1360 blocked for more than 120 seconds. > > Workqueue: xfs-conv/dm-3 xfs_end_io [xfs] > > Call Trace: > > __schedule+0x44c/0x8a0 > > schedule+0x4f/0xc0 > > xlog_grant_head_wait+0xb5/0x1a0 [xfs] > > xlog_grant_head_check+0xe1/0x100 [xfs] > > Threads are stuck waiting for log space; can you post the full dmesg? > And the xfs_info output of the test device? > I think I maybe didn't expose correctly the problems. I mentioned two separated issues: 1. The thread stuck problem, to which the dmesg refers to. It is happening to us in production and we're trying to replicate it in a test environment but without any success at the moment. 2. While I was using xfstests, in the hope it could replicate the problem, I stepped into the buggy behaviour of the test n. 311. And I wanted to report this to you (and maybe get some advice on the production's problem that is affecting us) I hope I made myself clear, sorry for the misunderstanding. Here the full dmesg, about the production's issue: [Thu Dec 23 17:13:56 2021] INFO: task xfs-conv/dm-3:1360 blocked for more than 120 seconds. [Thu Dec 23 17:13:56 2021] Not tainted 5.11.0-1020-aws #21~20.04.2-Ubuntu [Thu Dec 23 17:13:56 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Dec 23 17:13:56 2021] task:xfs-conv/dm-3 state:D stack: 0 pid: 1360 ppid: 2 flags:0x00004000 [Thu Dec 23 17:13:56 2021] Workqueue: xfs-conv/dm-3 xfs_end_io [xfs] [Thu Dec 23 17:13:56 2021] Call Trace: [Thu Dec 23 17:13:56 2021] __schedule+0x44c/0x8a0 [Thu Dec 23 17:13:56 2021] schedule+0x4f/0xc0 [Thu Dec 23 17:13:56 2021] xlog_grant_head_wait+0xb5/0x1a0 [xfs] [Thu Dec 23 17:13:56 2021] xlog_grant_head_check+0xe1/0x100 [xfs] [Thu Dec 23 17:13:56 2021] xfs_log_reserve+0xc2/0x1c0 [xfs] [Thu Dec 23 17:13:56 2021] xfs_trans_reserve+0x1ca/0x210 [xfs] [Thu Dec 23 17:13:56 2021] xfs_trans_alloc+0xd7/0x190 [xfs] [Thu Dec 23 17:13:56 2021] xfs_iomap_write_unwritten+0x125/0x2e0 [xfs] [Thu Dec 23 17:13:56 2021] xfs_end_ioend+0xdb/0x120 [xfs] [Thu Dec 23 17:13:56 2021] xfs_end_io+0xb7/0xe0 [xfs] [Thu Dec 23 17:13:56 2021] process_one_work+0x220/0x3c0 [Thu Dec 23 17:13:56 2021] rescuer_thread+0x2ca/0x3b0 [Thu Dec 23 17:13:56 2021] ? worker_thread+0x3f0/0x3f0 [Thu Dec 23 17:13:56 2021] kthread+0x12b/0x150 [Thu Dec 23 17:13:56 2021] ? set_kthread_struct+0x40/0x40 [Thu Dec 23 17:13:56 2021] ret_from_fork+0x22/0x30 > > When I realized that xfs test n. 311 was passing correctly but every > > further attempt to use the block device (e.g. mount it) was failing. > > The issue is reproducible after reboot. > > > > Test passed: > > ./check xfs/311 > > FSTYP -- xfs (non-debug) > > PLATFORM -- Linux/x86_64 test 5.11.0-1021-aws > > #22~20.04.2-Ubuntu SMP Wed Oct 27 21:27:13 UTC 2021 > > MKFS_OPTIONS -- -f /dev/xvdz > > MOUNT_OPTIONS -- /dev/xvdz /home/test/z > > > > xfs/311 25s ... 25s > > Ran: xfs/311 > > Passed all 1 tests > > > > Fail: > > # mount /dev/xvdz /home/test/z/ > > mount: /home/test/z: /dev/xvdz already mounted or mount point busy. > > [ 2222.028417] /dev/xvdz: Can't open blockdev > > > > lsof does not show anything that is using either /dev/xvdz or /home/test/z > > > > Any idea why is this happening? > > xfs-conv handles unwritten extent conversion after writeback, so I would > speculate (without dmesg data) that everyone got wedged trying to start > a transaction, and the log is blocked up for whatever reason. > I run the test xfs/311 with the Kernel v5.13 and the behaviour is the same as with the 5.11: after running it the disk becomes unusable. Trying to rerun the test, after the first successful run, leads to: FSTYP -- xfs (non-debug) PLATFORM -- Linux/x86_64 test 5.11.0-1021-aws #22~20.04.2-Ubuntu SMP Wed Oct 27 21:27:13 UTC 2021 MKFS_OPTIONS -- -f /dev/xvdz MOUNT_OPTIONS -- /dev/xvdz /home/test/z our local _scratch_mkfs routine ... mkfs.xfs: cannot open /dev/xvdz: Device or resource busy check: failed to mkfs $SCRATCH_DEV using specified options Interrupted! Passed all 0 tests with no errors in dmesg. xfs_info /dev/xvdz meta-data=/dev/xvdz isize=512 agcount=4, agsize=32768000 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=131072000, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=64000, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 > > The `xlog_grant_head_wait` race issue has been resolved in a later > > Kernel version, am I right? > > Beats me. > > > Best regards, > > Andrea > > > > -- > > > > > > > > > > > > > > > > The contents of this email are confidential. If the reader of this > > Not anymore they aren't. > > --D > > > message is not the intended recipient, you are hereby notified that any > > dissemination, distribution or copying of this communication is strictly > > prohibited. If you have received this communication in error, please notify > > us immediately by replying to this message and deleting it from your > > computer. Thank you. Devo, Inc; arco@xxxxxxxx <mailto:arco@xxxxxxxx>; > > Calle Estébanez Calderón 3-5, 5th Floor. Madrid, Spain 28020 > > -- The contents of this email are confidential. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to this message and deleting it from your computer. Thank you. Devo, Inc; arco@xxxxxxxx <mailto:arco@xxxxxxxx>; Calle Estébanez Calderón 3-5, 5th Floor. Madrid, Spain 28020