On 11/20/22 23:37, Ming Lei wrote: > On Sat, Nov 19, 2022 at 09:24:01AM +0900, Damien Le Moal wrote: >> On 11/18/22 21:47, Ming Lei wrote: >>> On Fri, Nov 18, 2022 at 12:49:15PM +0100, Andreas Hindborg wrote: >>>> >>>> Ming Lei <ming.lei@xxxxxxxxxx> writes: >>>> >>>>> CAUTION: This email originated from outside of Western Digital. Do not click on >>>>> links or open attachments unless you recognize the sender and know that the >>>>> content is safe. >>>>> >>>>> >>>>> On Fri, Nov 18, 2022 at 10:41:31AM +0100, Andreas Hindborg wrote: >>>>>> >>>>>> Ming Lei <ming.lei@xxxxxxxxxx> writes: >>>>>> >>>>>>> CAUTION: This email originated from outside of Western Digital. Do not click on >>>>>>> links or open attachments unless you recognize the sender and know that the >>>>>>> content is safe. >>>>>>> >>>>>>> >>>>>>> On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote: >>>>>>>> On 11/18/22 13:12, Ming Lei wrote: >>>>>>>> [...] >>>>>>>>>>> You can only assign it to zoned write request, but you still have to check >>>>>>>>>>> the sequence inside each zone, right? Then why not just check LBAs in >>>>>>>>>>> each zone simply? >>>>>>>>>> >>>>>>>>>> We would need to know the zone map, which is not otherwise required. >>>>>>>>>> Then we would need to track the write pointer for each open zone for >>>>>>>>>> each queue, so that we can stall writes that are not issued at the write >>>>>>>>>> pointer. This is in effect all zones, because we cannot track when zones >>>>>>>>>> are implicitly closed. Then, if different queues are issuing writes to >>>>>>>>> >>>>>>>>> Can you explain "implicitly closed" state a bit? >>>>>>>>> >>>>>>>>> From https://zonedstorage.io/docs/introduction/zoned-storage, only the >>>>>>>>> following words are mentioned about closed state: >>>>>>>>> >>>>>>>>> ```Conversely, implicitly or explicitly opened zoned can be transitioned to the >>>>>>>>> closed state using the CLOSE ZONE command.``` >>>>>>>> >>>>>>>> When a write is issued to an empty or closed zone, the drive will >>>>>>>> automatically transition the zone into the implicit open state. This is >>>>>>>> called implicit open because the host did not (explicitly) issue an open >>>>>>>> zone command. >>>>>>>> >>>>>>>> When there are too many implicitly open zones, the drive may choose to >>>>>>>> close one of the implicitly opened zone to implicitly open the zone that >>>>>>>> is a target for a write command. >>>>>>>> >>>>>>>> Simple in a nutshell. This is done so that the drive can work with a >>>>>>>> limited set of resources needed to handle open zones, that is, zones that >>>>>>>> are being written. There are some more nasty details to all this with >>>>>>>> limits on the number of open zones and active zones that a zoned drive may >>>>>>>> have. >>>>>>> >>>>>>> OK, thanks for the clarification about implicitly closed, but I >>>>>>> understand this close can't change the zone's write pointer. >>>>>> >>>>>> You are right, it does not matter if the zone is implicitly closed, I >>>>>> was mistaken. But we still have to track the write pointer of every zone >>>>>> in open or active state, otherwise we cannot know if a write that arrive >>>>>> to a zone with no outstanding IO is actually at the write pointer, or >>>>>> whether we need to hold it. >>>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> zone info can be cached in the mapping(hash table)(zone sector is the key, and zone >>>>>>>>> info is the value), which can be implemented as one LRU style. If any zone >>>>>>>>> info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for >>>>>>>>> obtaining the zone info. >>>>>>>>> >>>>>>>>>> the same zone, we need to sync across queues. Userspace may have >>>>>>>>>> synchronization in place to issue writes with multiple threads while >>>>>>>>>> still hitting the write pointer. >>>>>>>>> >>>>>>>>> You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq() >>>>>>>>> in order, no matter MQ or SQ. >>>>>>>>> >>>>>>>>> Yes, it could be issue from multiple queues for ublksrv, which doesn't sync >>>>>>>>> among multiple queues. >>>>>>>>> >>>>>>>>> But per-zone re-order still can solve the issue, just need one lock >>>>>>>>> for each zone to cover the MQ re-order. >>>>>>>> >>>>>>>> That lock is already there and using it, mq-deadline will never dispatch >>>>>>>> more than one write per zone at any time. This is to avoid write >>>>>>>> reordering. So multi queue or not, for any zone, there is no possibility >>>>>>>> of having writes reordered. >>>>>>> >>>>>>> oops, I miss the single queue depth point per zone, so ublk won't break >>>>>>> zoned write at all, and I agree order of batch IOs is one problem, but >>>>>>> not hard to solve. >>>>>> >>>>>> The current implementation _does_ break zoned write because it reverses >>>>>> batched writes. But if it is an easy fix, that is cool :) >>>>> >>>>> Please look at Damien's comment: >>>>> >>>>>>> That lock is already there and using it, mq-deadline will never dispatch >>>>>>> more than one write per zone at any time. This is to avoid write >>>>>>> reordering. So multi queue or not, for any zone, there is no possibility >>>>>>> of having writes reordered. >>>>> >>>>> For zoned write, mq-deadline is used to limit at most one inflight write >>>>> for each zone. >>>>> >>>>> So can you explain a bit how the current implementation breaks zoned >>>>> write? >>>> >>>> Like Damien wrote in another email, mq-deadline will only impose >>>> ordering for requests submitted in batch. The flow we have is the >>>> following: >>>> >>>> - Userspace sends requests to ublk gendisk >>>> - Requests go through block layer and is _not_ reordered when using >>>> mq-deadline. They may be split. >>>> - Requests hit ublk_drv and ublk_drv will reverse order of _all_ >>>> batched up requests (including split requests). >>> >>> For ublk-zone, ublk driver needs to be exposed as zoned device by >>> calling disk_set_zoned() finally, which definitely isn't supported now, >>> so mq-deadline at most sends one write IO for each zone after ublk-zone >>> is supported, see blk_req_can_dispatch_to_zone(). >>> >>>> - ublk_drv sends request to ublksrv in _reverse_ order. >>>> - ublksrv sends requests _not_ batched up to target device. >>>> - Requests that enter mq-deadline at the same time are reordered in LBA >>>> order, that is all good. >>>> - Requests that enter the kernel in different batches are not reordered >>>> in LBA order and end up missing the write pointer. This is bad. >>> >>> Again, please read Damien's comment: >>> >>>>> That lock is already there and using it, mq-deadline will never dispatch >>>>> more than one write per zone at any time. >>> >>> Anytime, there is at most one write IO for each zone, how can the single >>> write IO be re-order? >> >> If the user issues writes one at a time out of order (not aligned to the >> write pointer), mq-deadline will not help at all. The zone write locking >> will still limit write dispatching to one per zone, but the writes will fail. >> >> mq-deadline will reorder write commands in the correct lba order only if: >> - the commands are inserted as a batch (more than on request passed to >> ->insert_requests) >> - commands are inserted individually when the target zone is locked (a >> write is already being executed) >> >> This has been the semantic from the start: the block layer has no >> guarantees about the correct ordering of writes to zoned drive. What is >> guaranteed is that (1) if the user issues writes in order AND (2) >> mq-deadline is used, then writes will be dispatched in the same order to >> the device. >> >> I have not looked at the details of ublk, but from the thread, I think (1) >> is not done and (2) is missing-ish as the ublk device is not marked as zoned. > > (2) is supposed to be done for support ublk-zone, at least in the > Andreas's PR, which just implements one ublk-loop over one real backing > zoned disk. > > mq-deadline dispatches one write IO for each zone on /dev/ublkbN( > supposed to be exposed as zoned, not done yet), and this write IO will > be forwarded to ublksrv, so ublksrv can only see one such write IO for > each zone, which will be re-issued to the backing real zoned disk, so > there can't be the write io reorder issue, cause the write batch just > has one write IO for each zone. Good point ! > > > [1] https://github.com/ming1/ubdsrv/pull/28/files > > Thanks, > Ming > -- Damien Le Moal Western Digital Research