On 2019/07/09 22:51, Bart Van Assche wrote: > On 7/9/19 2:02 AM, Damien Le Moal wrote: >> Simultaneously writing to a sequential zone of a zoned block device >> from multiple contexts requires mutual exclusion for BIO issuing to >> ensure that writes happen sequentially. However, even for a well >> behaved user correctly implementing such synchronization, BIO plugging >> may interfere and result in BIOs from the different contextx to be >> reordered if plugging is done outside of the mutual exclusion section, >> e.g. the plug was started by a function higher in the call chain than >> the function issuing BIOs. >> >> Context A Context B >> >> | blk_start_plug() >> | ... >> | seq_write_zone() >> | mutex_lock(zone) >> | submit_bio(bio-0) >> | submit_bio(bio-1) >> | mutex_unlock(zone) >> | return >> | ------------------------------> | seq_write_zone() >> | mutex_lock(zone) >> | submit_bio(bio-2) >> | mutex_unlock(zone) >> | <------------------------------ | >> | blk_finish_plug() >> >> In the above example, despite the mutex synchronization resulting in the >> correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being >> issued after BIO 2 when the plug is released with blk_finish_plug(). >> >> To fix this problem, introduce the internal helper function >> blk_mq_plug() to access the current context plug, return the current >> plug only if the target device is not a zoned block device or if the >> BIO to be plugged not a write operation. Otherwise, ignore the plug and >> return NULL, resulting is all writes to zoned block device to never be >> plugged. > > Are there classes of zoned devices for which the plug list is useful? If > so, have you considered any other approaches, e.g. one plug list per > request queue instead of one plug list per task in case of zoned devices? Plugging for writes to zoned block devices is not really useful at all. The reason is that for any user of the disk executing requests at a queue depth larger than 1, to preserve write ordering, mq-deadline must be used. With this scheduler, zone write locking will prevent dispatching more than one write request per zone at any time, resulting in the accumulation of sequential writes for a zone in the scheduler queue. This creates plenty of opportunities for merging small (i.e. single page) write BIOs with preceding pending requests, which is exactly the intent of plugging in the first place. A per request queue plug list would work, but it would require a single lock, going against blk-mq design principle. Such method would also result in a lot more changes for no real gain at all (for the reason explained above). Performance-wise, simply disabling per context plugging for writes only has no measurable impact and is far simpler I think. Best regards. -- Damien Le Moal Western Digital Research