Re: [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes

Bart Van Assche <bvanassche@xxxxxxx> · Thu, 21 Nov 2024 09:51:16 -0800

On 11/20/24 7:32 PM, Damien Le Moal wrote:
On 11/20/24 06:04, Bart Van Assche wrote:
On 11/18/24 6:57 PM, Damien Le Moal wrote:
On 11/19/24 9:27 AM, Bart Van Assche wrote:
If the queue is filled with unaligned writes then the following
deadlock occurs:

Call Trace:
   <TASK>
   __schedule+0x8cc/0x2190
   schedule+0xdd/0x2b0
   blk_queue_enter+0x2ce/0x4f0
   blk_mq_alloc_request+0x303/0x810
   scsi_execute_cmd+0x3f4/0x7b0
   sd_zbc_do_report_zones+0x19e/0x4c0
   sd_zbc_report_zones+0x304/0x920
   disk_zone_wplug_handle_error+0x237/0x920
   disk_zone_wplugs_work+0x17e/0x430
   process_one_work+0xdd0/0x1490
   worker_thread+0x5eb/0x1010
   kthread+0x2e5/0x3b0
   ret_from_fork+0x3a/0x80
   </TASK>

Fix this deadlock by removing the disk->fops->report_zones() call and by
deriving the write pointer information from successfully completed zoned
writes.

Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx>

Doesn't this need a Fixes tag and CC stable, and come earlier in the series (or
sent separately) ?

I will add Fixes: and Cc: stable tags.

I'm not sure how to move this patch earlier since it depends on the
previous patch in this series ("blk-zoned: Only handle errors after
pending zoned writes have completed"). Without that patch, it is not
safe to use zwplug->wp_offset_compl in the error handler.

Overall, this patch seems wrong anyway as zone reset and zone finish may be
done between 2 writes, failing the next one but the recovery done here will use
the previous succeful write end position as the wp, which is NOT correct as
reset or finish changed that...

I will add support for the zone reset and zone finish commands in this
patch.

And we also have the possibility of torn writes
(partial writes) with SAS SMR drives. So I really think that you cannot avoid
doing a report zone to recover errors.

Thanks for having brought this up. This is something I was not aware of.

disk_zone_wplug_handle_error() submits a new request to retrieve zone
information while handling an error triggered by other requests. This
can easily lead to a deadlock as the above call trace shows. How about
introducing a queue flag for the "report zones" approach in
disk_zone_wplug_handle_error() such that the "report zones" approach is
only used for SAS SMR drives?

Sure, but how would that solve the potential deadlock problem ? ALso, I am not
entirely clear on how the deadlock can happen given that zone write plugs are
queueing/blocking BIOs, not requests. So even assuming you have a large number
of BIOs plugged in a zone write plug, the error handler work should still be
able to issue a request to do a report zones, no ? On which resource can the
deadlock happen ? Plugged BIOs do not yet use a tag, right ?

What am I missing here ? Or is it maybe something that can happen with your
modifications because you changed the zone write plug behavior to allow for more
than one BIO at a time being unplugged and issued to the device ?

Note that if you do have a test case for this triggering the deadlock, we
definitely need to solve this and ideally have a blktest case checking it.

Hi Damien,

The call trace mentioned above comes from the kernel log and was
encountered while I was testing this patch series. A reproducer has
already been shared - see also
https://lore.kernel.org/linux-block/e840b66a-79f0-4169-9ab1-c475d9608e4d@xxxxxxx/. 
The lockup happened after the queue was filled up
with requests and hence sd_zbc_report_zones() failed to allocate an
additional request for the zone report.

I'm wondering whether this lockup can also happen with the current
upstream kernel by submitting multiple unaligned writes simultaneously
where each write affects another zone and where the number of writes
matches the queue depth.

Thanks,

Bart.