On Tue, Sep 12, 2017 at 05:24:02PM +0900, Damien Le Moal wrote: > Ming, > > On 9/10/17 14:10, Ming Lei wrote: > > On Fri, Sep 08, 2017 at 09:53:53AM -0700, Damien Le Moal wrote: > >> Ming, > >> > >> On 9/8/17 05:43, Ming Lei wrote: > >>> Hi Damien, > >>> > >>> On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote: > >>>> In the case of a ZBC disk used with scsi-mq, zone write locking does > >>>> not prevent write reordering in sequential zones. Unlike the legacy > >>>> case, zone locking can only be done after the command request is > >>>> removed from the scheduler dispatch queue. That is, at the time of > >>>> zone locking, the write command may already be out of order. > >>> > >>> Per my understanding, for legacy case, it can be quite tricky to let > >>> the existed I/O scheduler guarantee the write order for ZBC disk. > >>> I guess requeue still might cause write reorder even in legacy path, > >>> since requeue can happen in both scsi_request_fn() and scsi_io_completion() > >>> with q->queue_lock released, meantime new rq belonging to the same > >>> zone can come and be inserted to queue. > >> > >> Yes, the write ordering will always depend on the scheduler doing the > >> right thing. But both cfq, deadline and even noop do the right thing > >> there, even considering the aging case. The next write for a zone will > >> always be the oldest in the queue for that zone, if it is not, it means > >> that the application did not write sequentially. Extensive testing in > >> the legacy case never showed a problem due to the scheduler itself. > > > > OK, I suggest to document this guarantee of no write reorder for ZBC > > somewhere, so that people will keep it in mind when trying to change > > the current code. > > Have you looked at the comments in sd_zbc.c ? That is explained there. > Granted, this is a little deep in the stack, but this is after all > dependent on the implementation of scsi_request_fn(). I can add comments > there too if you prefer. Yeah, I looked at that, but seems it is too coarse. > > >> scsi_requeue_command() does the unprep (zone unlock) and requeue while > >> holding the queue lock. So this is atomic with new write command > >> insertion. Requeued commands are added to the dispatch queue head, and > >> since a zone will only have a single write in-flight, there is no > >> reordering possible. The next write command for a zone to go again is > >> the last requeued one or the next in lba order. It works. > > > > One special case is write with FLUSH/FUA, which may be added to > > front of q->queue_head directly. Suppose one write with FUA is > > just comes between requeue and run queue, write reorder may be > > triggered. > > Zoned disks are recent and all of them support FUA. This means that a > write with FUA will be processed like any other write request (if I read > the code in blk-flush.c correctly). Well, at least for the mq case, > which does a blk_mq_sched_insert_request(), while the direct call to blk_mq_sched_bypass_insert() can be called for flush requests too, since requests in flush sequence share one driver tag(rq->tag != -1), then the rq can be added to front of hctx->dispatch directly. -- Ming