Deadlock Issue with blk-wbt and raid5+journal

Logan Gunthorpe <logang@xxxxxxxxxxxx> · Thu, 25 Aug 2022 16:19:35 -0600

Hi Jens,

While testing md/raid5 with the journal option using loop devices, I've
found an easily reproducible hang on my system. Simply running an fio
write job with the md threadcnt set to 4 can hit it. However, curiously,
it is not hit without the journal being used.

I'm running on the current md/md-next branch; however I've seen this bug
for a couple months now on recent kernels and have no idea how long it's
been in the kernel for.

I end up seeing multiple hung tasks with the following stack trace:

   schedule+0x9e/0x140
  io_schedule+0x70/0xb0
  rq_qos_wait+0x153/0x210
  wbt_wait+0x127/0x1f0
  __rq_qos_throttle+0x38/0x60
  blk_mq_submit_bio+0x589/0xcd0
  __submit_bio+0xe6/0x100
  submit_bio_noacct_nocheck+0x42e/0x470
  submit_bio_noacct+0x4c2/0xbb0
  ops_run_io+0x46b/0x1a30
  handle_stripe+0xcd3/0x36c0
  handle_active_stripes.constprop.0+0x6f6/0xa60
  raid5_do_work+0x177/0x330
  process_one_work+0x609/0xb00
  worker_thread+0x2d4/0x710
  kthread+0x18c/0x1c0
  ret_from_fork+0x1f/0x30

When this happens, I find 1 to 10ish inflight IO on the WBT of the
underlying loop devices as seen in
'/sys/kernel/debug/block/loop[0-3]/rqos/wbt/inflight'.

I've done some debugging in this area and this is what I'm seeing:

There are a few IO in the WBT that start scheduling when the inflight
counter reaches the limit (96 in my case). Then, a number of IO tasks
are scheduled after the limit gets exceeded. So far that makes sense. I
put some tracing in wbt_rqw_done() and can see that the inflight counts
back down to a low number as other IO are completed, but then it hangs
before reaching zero. However, wbt_rqw_done() never wakes up any other
threads  because, for some reason, wb_recent_wait(rwb) always returns
false and thus the limit is always zero, and the conditional:

        if (inflight && inflight >= limit)
                return;

always gets hit because inflight is always greater than the zero limit
(as some inflight IO are sleeping waiting to be worken up). Thus the
sleeping tasks remain sleeping forever. I've also verified that
rwb_wake_all() never gets called in this scenario as well.

Given the conditions of hitting the bug, I fully expected this to be an
issue in the raid code, but unless I'm missing something, it sure looks
to me like a deadlock issue in the wbt code, which makes me wonder why
nobody else has hit it. Is there something else I'm missing that are
supposed to be waking up these processes? Or something weird about the
raid5+journal+loop code that is causing wb_recent_wait() to always be false?

Any thoughts?

Thanks,

Logan