On 8/10/17 11:14, Damien Le Moal wrote: > Bart, > > On 8/10/17 00:24, Bart Van Assche wrote: >> On Wed, 2017-08-09 at 14:15 +0900, Damien Le Moal wrote: >>> Overall, yes, that is possible. However, considering only write commands >>> to a single zone, since at any time only one command is dequeued (the >>> one holding the zone lock), having the "lock+dequeue" and >>> "unlock+requeue" both atomic prevent any reordering of write commands >>> directed to that zone. The first command in the queue for the zone will >>> either (1) get the zone lock and be dequeued OR (2) stay in the queue. >>> In case (1), the next write for the zone will hit case (2) until the >>> dispatch command completes. If the dispatch command is requeued (at the >>> dispatch queue head), we hit back again the (1) or (2) cases, without >>> any change in the order. Isnt't it ? >> >> Hello Damien, >> >> Commands that get requeued are not reinserted at their original position >> in the request queue but usually at a different position. This is what makes >> requeueing cause request reordering. Anyway, can you check whether replacing >> patch "sd_zbc: Write unlock zone from sd_uninit_cmnd()" with the (untested) >> patch below does not trigger any lockups and prevents request reordering? >> Since that patch keeps the zone lock across requeuing attempts it should >> prevent request reordering. > > Your patch will indeed prevent deadlock for the case where the write > command that acquired the lock gets re-queued before running/completing. > If after re-queueing the command end ups being *before* any following > sequential write to the same zone, it will be fine as the flag indicates > that the zone lock is already acquired. So the command prep will not > cause deadlocking. > > However, the failure pattern I have seen most of the time is a write > command never tried before ending up at the head of the dispatch queue, > trying to acquire the zone lock and failing to do so because the lock > owning command is behind in the queue. > > Ex: The dispatch queue has sequential write commands A at the head and > write B following. > (1) A is dequeued and prep-ed, the zone lock is acquired. > (2) For whatever reason, A gets requeued, still holding the zone lock. > (3) Between (1) and (2), another context tries to push command B and > dequeues it. Zone lock fails and (B) goes for requeue (using a > workqueue, so different context) > (4) If the requeue work for A runs before the one for B (instead of both > A and B ending up in the same work), then B ends up at the head of the > queue. > (5) From here, B fails systematically. Requests behind B are not tried > since the device is marked busy. So B stays at the head and the failure > persists. Correction: At step (3), since command B is not issued to the HBA, it goes for requeue in the same context as scsi_queue_rq() call. At step (4), since A was issued, requeue is done with the requeue list workqueue. But A can still end up behind B because initially, B is only requeued at the head of the dispatch local list, which is itself spliced into the dispatch queue after that. A requeue may still be done before B goes back in the dispatch queue. > > The patch I sent is not about solving the ordering problem. Never was. > It just avoids all possible deadlock situations. > > There are more reordering situations possible higher up in the stack. > Ex: without a scheduler, blk-mq simply moves all submitted requests from > the CPU context queues into the hctx dispatch queue, in CPU order. This > means that a thread cleanly writing sequentially to a zone but moved > from one CPU to another half-way through the sequence can see its write > reordered in the dispatch queue. Same for the actual dispatch code > calling scsi_queue_rq(). This code loops over a local list, not the hctx > queue. So this code running from 2 different contexts with the second > one starting half way through our well behaving thread sequence will end > up splitting the sequence into 2 different local lists and so loosing > the sequential ordering. > > The no-scheduler case is a little special but needs handling. As for the > blk-mq dispatch code, we need that serialized/single threaded for zoned > block devices. I am inclined to say that this should be the case for any > HDD anyway as performance can only be better. Ming's series already > brings improvements, but no ordering guarantees for zoned devices. > > I am currently trying different approaches for this. In the mean time, I > would like to see the unlock change patch be applied to fix the deadlock > problem. > > Best. > >> >> Thanks, >> >> Bart. >> >> --- >> drivers/scsi/sd_zbc.c | 8 ++++++++ >> include/scsi/scsi_cmnd.h | 3 +++ >> 2 files changed, 11 insertions(+) >> >> diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c >> index 96855df9f49d..6d18b1bd3b26 100644 >> --- a/drivers/scsi/sd_zbc.c >> +++ b/drivers/scsi/sd_zbc.c >> @@ -290,10 +290,15 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd) >> * threads running on different CPUs write to the same >> * zone (with a synchronized sequential pattern). >> */ >> + if (cmd->flags & SCMD_IN_PROGRESS) >> + return BLKPREP_OK; >> + >> if (sdkp->zones_wlock && >> test_and_set_bit(zno, sdkp->zones_wlock)) >> return BLKPREP_DEFER; >> >> + cmd->flags |= SCMD_IN_PROGRESS; >> + >> return BLKPREP_OK; >> } >> >> @@ -302,6 +307,9 @@ void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd) >> struct request *rq = cmd->request; >> struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); >> >> + WARN_ON_ONCE(!(cmd->flags & SCMD_IN_PROGRESS)); >> + cmd->flags &= ~SCMD_IN_PROGRESS; >> + >> if (sdkp->zones_wlock) { >> unsigned int zno = sd_zbc_zone_no(sdkp, blk_rq_pos(rq)); >> WARN_ON_ONCE(!test_bit(zno, sdkp->zones_wlock)); >> diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h >> index a1266d318c85..f18c812094b5 100644 >> --- a/include/scsi/scsi_cmnd.h >> +++ b/include/scsi/scsi_cmnd.h >> @@ -57,6 +57,9 @@ struct scsi_pointer { >> /* for scmd->flags */ >> #define SCMD_TAGGED (1 << 0) >> #define SCMD_UNCHECKED_ISA_DMA (1 << 1) >> +#ifdef CONFIG_BLK_DEV_ZONED >> +#define SCMD_IN_PROGRESS (1 << 2) >> +#endif >> >> struct scsi_cmnd { >> struct scsi_request req; >> > -- Damien Le Moal, Ph.D. Sr Manager, System Software Group, Western Digital Research Damien.LeMoal@xxxxxxx Tel: (+81) 0466-98-3593 (Ext. 51-3593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.wdc.com, www.hgst.com