writeback completion soft lockup BUG in folio_wake_bit()

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 15 Mar 2022 15:07:10 -0400

Hi,

I've been chasing down an issue recently that results in the following
soft lockup warning in XFS writeback (iomap_ioend) completion:

 watchdog: BUG: soft lockup - CPU#42 stuck for 208s! [kworker/42:0:52508]
 ...
 CPU: 42 PID: 52508 Comm: kworker/42:0 Tainted: G S           L    5.17.0-rc8 #5
 Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.2.4 05/28/2021
 Workqueue: xfs-conv/dm-0 xfs_end_io [xfs]
 RIP: 0010:_raw_spin_unlock_irqrestore+0x1a/0x31
 Code: 74 01 c3 0f 1f 44 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 74 01 fb bf 01 00 00 00 <e8> f1 6e 6a ff 65 8b 05 7a e3 1a 63 85 c0 74 01 c3 0f 1f 44 00 00
 RSP: 0018:ff4a81e4a8a37ce8 EFLAGS: 00000206
 RAX: 0000000000000001 RBX: ffffffff9de08b28 RCX: ff4a81e4a6cfbd00
 RDX: ff4a81e4a6cfbd00 RSI: 0000000000000246 RDI: 0000000000000001
 RBP: 0000000000000246 R08: ff4a81e4a6cfbd00 R09: 000000000002f8c0
 R10: 00006dfe673f3ac5 R11: 00000000fa83b2da R12: 0000000000000fa8
 R13: ffcb6e504b14e640 R14: 0000000000000000 R15: 0000000000001000
 FS:  0000000000000000(0000) GS:ff1a26083fd40000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000888000 CR3: 0000000790f4a006 CR4: 0000000000771ee0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554 
 Call Trace:
  <TASK>
  folio_wake_bit+0x8a/0x110
  folio_end_writeback+0x37/0x80
  iomap_finish_ioend+0xc6/0x330
  iomap_finish_ioends+0x93/0xd0
  xfs_end_ioend+0x5e/0x150 [xfs]
  xfs_end_io+0xaf/0xe0 [xfs]
  process_one_work+0x1c5/0x390
  ? process_one_work+0x390/0x390
  worker_thread+0x30/0x350
  ? process_one_work+0x390/0x390
  kthread+0xe6/0x110
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>

This is reproduced via repeated iterations of the LTP diotest3 test [1]
on a 96xCPU system. This usually first occurs in a transient manner,
dumping two or three warnings and then the system recovers and test
continues, but it eventually reproduces a full on livelock in this same
path. I think I've narrowed the root cause of the warning to commit
c2407cf7d22d0 ("mm: make wait_on_page_writeback() wait for multiple
pending writebacks") (last discussed here[2] afaics). The first thing to
note is that I think the dio aspect of the test program is not
important. The test in this case creates 100x tasks that each run a
"buffered write -> fsync -> dio read" loop on an isolated range of the
test file, where the buffered writes and fsyncs seem to be the primary
contributing factors.

What seems to happen is that the majority of the fsync calls end up
waiting on writeback of a particular page, the wakeup of the writeback
bit on that page wakes a task that immediately resets PG_writeback on
the page such that N other folio_wait_writeback() waiters see the bit
still set and immediately place themselves back onto the tail of the
wait queue.  Meanwhile the waker task spins in the WQ_FLAG_BOOKMARK loop
in folio_wake_bit() (backing off the lock for a cycle or so in each
iteration) only to find the same bunch of tasks in the queue. This
process repeats for a long enough amount of time to trigger the soft
lockup warning. I've confirmed this spinning behavior with a tracepoint
in the bookmark loop that indicates we're stuck for many hundreds of
thousands of iterations (at least) of this loop when the soft lockup
warning triggers.

As to the full livelock variant, I think we end up with a
write_cache_pages() task that is woken waiting for page P, immediately
resets PG_writeback on P and moves on to P+1. P+1 is also already under
writeback, so write_cache_pages() blocks here waiting on that. Page P
has still not been submitted for I/O because XFS/iomap batches I/O
submissions across multiple ->writepage() callbacks. The waker task
spinloop occurs on page P as described above, but since the waker task
is the XFS workqueue task associated with ioend completion on the inode
(because these are delalloc writes that require unwritten extent
conversion on completion), it will never get to waking page P+1 and
we're stuck for good. IOW:

1. Page P is stuck in PG_writeback on a queue awaiting I/O submission.
2. The P submitting task is blocked waiting on PG_writeback of page P+1.
3. The waker task responsible for waking P+1 is spinning waking tasks
for a previous PG_writeback state on page P. This wait queue will never
drain, however, because it consists of a large number of tasks (>
WAITQUEUE_WALK_BREAK_CNT) inserted via folio_wait_writeback() that will
never break the loop for P (because of step 1).

I've run a few quick experiments to try and corroborate this analysis.
The problem goes away completely if I either back out the loop change in
folio_wait_writeback() or bump WAITQUEUE_WALK_BREAK_CNT to something
like 128 (i.e. greater than the total possible number of waiter tasks in
this test). I've also played a few games with bookmark behavior mostly
out of curiosity, but usually end up introducing other problems like
missed wakeups, etc. I've not been able to reproduce this problem on
ext4, I suspect due to sufficiently different writeback/completion
batching behavior. I was thinking about whether skipping writeback of
PG_writeback pages after an explicit wait (removing the wait loop and
BUG_ON()) might avoid this problem and the one the loop is intended to
fix, but I'm not sure that would be safe. Thoughts or ideas?

Brian

[1] while [ true ]; do TMPDIR=<xfsdir> diotest3 -b 65536 -n 100 -i 100 -o 1024000; done
[2] https://lore.kernel.org/linux-mm/000000000000886dbd05b7ffa8db@xxxxxxxxxx/