On 11/22/24 2:51 AM, Bart Van Assche wrote: > > On 11/20/24 7:32 PM, Damien Le Moal wrote: >> On 11/20/24 06:04, Bart Van Assche wrote: >>> On 11/18/24 6:57 PM, Damien Le Moal wrote: >>>> On 11/19/24 9:27 AM, Bart Van Assche wrote: >>>>> If the queue is filled with unaligned writes then the following >>>>> deadlock occurs: >>>>> >>>>> Call Trace: >>>>> <TASK> >>>>> __schedule+0x8cc/0x2190 >>>>> schedule+0xdd/0x2b0 >>>>> blk_queue_enter+0x2ce/0x4f0 >>>>> blk_mq_alloc_request+0x303/0x810 >>>>> scsi_execute_cmd+0x3f4/0x7b0 >>>>> sd_zbc_do_report_zones+0x19e/0x4c0 >>>>> sd_zbc_report_zones+0x304/0x920 >>>>> disk_zone_wplug_handle_error+0x237/0x920 >>>>> disk_zone_wplugs_work+0x17e/0x430 >>>>> process_one_work+0xdd0/0x1490 >>>>> worker_thread+0x5eb/0x1010 >>>>> kthread+0x2e5/0x3b0 >>>>> ret_from_fork+0x3a/0x80 >>>>> </TASK> >>>>> >>>>> Fix this deadlock by removing the disk->fops->report_zones() call and by >>>>> deriving the write pointer information from successfully completed zoned >>>>> writes. >>>>> >>>>> Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx> >>>> >>>> Doesn't this need a Fixes tag and CC stable, and come earlier in the series >>>> (or >>>> sent separately) ? >>> >>> I will add Fixes: and Cc: stable tags. >>> >>> I'm not sure how to move this patch earlier since it depends on the >>> previous patch in this series ("blk-zoned: Only handle errors after >>> pending zoned writes have completed"). Without that patch, it is not >>> safe to use zwplug->wp_offset_compl in the error handler. >>> >>>> Overall, this patch seems wrong anyway as zone reset and zone finish may be >>>> done between 2 writes, failing the next one but the recovery done here will >>>> use >>>> the previous succeful write end position as the wp, which is NOT correct as >>>> reset or finish changed that... >>> >>> I will add support for the zone reset and zone finish commands in this >>> patch. >>> >>>> And we also have the possibility of torn writes >>>> (partial writes) with SAS SMR drives. So I really think that you cannot avoid >>>> doing a report zone to recover errors. >>> >>> Thanks for having brought this up. This is something I was not aware of. >>> >>> disk_zone_wplug_handle_error() submits a new request to retrieve zone >>> information while handling an error triggered by other requests. This >>> can easily lead to a deadlock as the above call trace shows. How about >>> introducing a queue flag for the "report zones" approach in >>> disk_zone_wplug_handle_error() such that the "report zones" approach is >>> only used for SAS SMR drives? >> >> Sure, but how would that solve the potential deadlock problem ? ALso, I am not >> entirely clear on how the deadlock can happen given that zone write plugs are >> queueing/blocking BIOs, not requests. So even assuming you have a large number >> of BIOs plugged in a zone write plug, the error handler work should still be >> able to issue a request to do a report zones, no ? On which resource can the >> deadlock happen ? Plugged BIOs do not yet use a tag, right ? >> >> What am I missing here ? Or is it maybe something that can happen with your >> modifications because you changed the zone write plug behavior to allow for more >> than one BIO at a time being unplugged and issued to the device ? >> >> Note that if you do have a test case for this triggering the deadlock, we >> definitely need to solve this and ideally have a blktest case checking it. > > Hi Damien, > > The call trace mentioned above comes from the kernel log and was > encountered while I was testing this patch series. A reproducer has > already been shared - see also > https://lore.kernel.org/linux-block/e840b66a-79f0-4169-9ab1- > c475d9608e4d@xxxxxxx/. The lockup happened after the queue was filled up > with requests and hence sd_zbc_report_zones() failed to allocate an > additional request for the zone report. > > I'm wondering whether this lockup can also happen with the current > upstream kernel by submitting multiple unaligned writes simultaneously > where each write affects another zone and where the number of writes > matches the queue depth. Let me check. This indeed may be a possibility. > > Thanks, > > Bart. > > -- Damien Le Moal Western Digital Research