On Mon, Aug 15, 2022 at 2:54 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > [cc Amir, the 5.10 stable XFS maintainer] > > On Tue, Aug 09, 2022 at 11:46:23AM +0000, bugzilla-daemon@xxxxxxxxxx wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=216343 > > > > Bug ID: 216343 > > Summary: XFS: no space left in xlog cause system hang > > Product: File System > > Version: 2.5 > > Kernel Version: 5.10.38 > > Hardware: ARM > > OS: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: XFS > > Assignee: filesystem_xfs@xxxxxxxxxxxxxxxxxxxxxx > > Reporter: zhoukete@xxxxxxx > > Regression: No > > > > Created attachment 301539 > > --> https://bugzilla.kernel.org/attachment.cgi?id=301539&action=edit > > stack > > > > 1. cannot login with ssh, system hanged and cannot do anything > > 2. dmesg report 'audit: audit_backlog=41349 > audit_backlog_limit=8192' > > 3. I send sysrq-crash and get vmcore file , I dont know how to reproduce it. > > > > Follwing is my analysis from vmcore: > > > > The reason why tty cannot login is pid 2021571 hold the acct_process mutex, and > > 2021571 cannot release mutex because it is wait for xlog release space. See the > > stac info in the attachment of stack.txt > > > > So I try to figure out what happened to xlog > > > > crash> struct xfs_ail.ail_target_prev,ail_targe,ail_head 0xffff00ff884f1000 > > ail_target_prev = 0xe9200058600 > > ail_target = 0xe9200058600 > > ail_head = { > > next = 0xffff0340999a0a80, > > prev = 0xffff020013c66b40 > > } > > > > there are 112 log item in ail list > > crash> list 0xffff0340999a0a80 | wc -l > > 112 > > > > 79 item of them are xlog_inode_item > > 30 item of them are xlog_buf_item > > > > crash> xfs_log_item.li_flags,li_lsn 0xffff0340999a0a80 -x > > li_flags = 0x1 > > li_lsn = 0xe910005cc00 ===> first item lsn > > > > crash> xfs_log_item.li_flags,li_lsn ffff020013c66b40 -x > > li_flags = 0x1 > > li_lsn = 0xe9200058600 ===> last item lsn > > > > crash>xfs_log_item.li_buf 0xffff0340999a0a80 > > li_buf = 0xffff0200125b7180 > > > > crash> xfs_buf.b_flags 0xffff0200125b7180 -x > > b_flags = 0x110032 (XBF_WRITE|XBF_ASYNC|XBF_DONE|_XBF_INODES|_XBF_PAGES) > > > > crash> xfs_buf.b_state 0xffff0200125b7180 -x > > b_state = 0x2 (XFS_BSTATE_IN_FLIGHT) > > > > crash> xfs_buf.b_last_error,b_retries,b_first_retry_time 0xffff0200125b7180 -x > > b_last_error = 0x0 > > b_retries = 0x0 > > b_first_retry_time = 0x0 > > > > The buf flags show the io had been done(XBF_DONE is set). > > When I review the code xfs_buf_ioend, if XBF_DONE is set, xfs_buf_inode_iodone > > will be called and it will remove the log item from ail list, then release the > > xlog space by moving the tail_lsn. > > > > But now this item is still in the ail list, and the b_last_error = 0, XBF_WRITE > > is set. > > > > xfs buf log item is the same as the inode log item. > > > > crash> list -s xfs_log_item.li_buf 0xffff0340999a0a80 > > ffff033f8d7c9de8 > > li_buf = 0x0 > > crash> xfs_buf_log_item.bli_buf ffff033f8d7c9de8 > > bli_buf = 0xffff0200125b4a80 > > crash> xfs_buf.b_flags 0xffff0200125b4a80 -x > > b_flags = 0x100032 (XBF_WRITE|XBF_ASYNC|XBF_DONE|_XBF_PAGES) > > > > I think it is impossible that (XBF_DONE is set & b_last_error = 0) and the item > > still in the ail. > > > > Is my analysis correct? I don't think so. I think this buffer write is in-flight. > > Why xlog space cannot release space? Not sure if space cannot be released or just takes a lot of time. There are several AIL/CIL improvements in upstream kernel and none of them are going to land in 5.10.y. The reported kernel version 5.10.38 has almost no upstream fixes at all, but I don't think that any of the fixes in 5.10.y are relevant for this case anyway. If this hang happens often with your workload, I suggest using a newer kernel and/or formatting xfs with a larger log to meet the demands of your workload. Thanks, Amir.