Josef, On 2019/06/13 23:15, Josef Bacik wrote: > On Fri, Jun 07, 2019 at 10:10:17PM +0900, Naohiro Aota wrote: >> Sequential allocation is not enough to maintain sequential delivery of >> write IOs to the device. Various features (async compress, async checksum, >> ...) of btrfs affect ordering of the IOs. This patch introduces submit >> buffer to sort WRITE bios belonging to a block group and sort them out >> sequentially in increasing block address to achieve sequential write >> sequences with __btrfs_map_bio(). >> >> Signed-off-by: Naohiro Aota <naohiro.aota@xxxxxxx> > > I hate everything about this. Can't we just use the plugging infrastructure for > this and then make sure it re-orders the bios before submitting them? Also > what's to prevent the block layer scheduler from re-arranging these io's? > Thanks, The block I/O scheduler reorders requests in LBA order, but that happens for a newly inserted request against pending requests. If there are no pending requests because all requests were already issued, no ordering happen, and even worse, if the drive queue is not full yet (e.g. there are free tags), then the newly inserted request will be dispatched almost immediately, preventing reordering with subsequent incoming write requests to happen. The other problem is that the mq-deadline scheduler does not track zone WP position. Write request issuing is done regardless of the current WP value, solely based on LBA ordering. This means that mq-deadline will not prevent out-of-order, or rather, unaligned write requests. These will not be detected and dispatched whenever possible. The reasons for this are that: 1) the disk user (the FS) has to manage zone WP positions anyway. So duplicating that management at the block IO scheduler level is inefficient. 2) Adding zone WP management at the block IO scheduler level would also need a write error processing path to resync the WP value in case of failed writes. But the user/FS also needs that anyway. Again duplicated functionalities. 3) The block layer will need a timeout to force issue or cancel pending unaligned write requests. This is necessary in case the drive user stops issuing writes (for whatever reasons) or the scheduler is being switched. This would unnecessarily cause write I/O errors or cause deadlocks if the request queue quiesce mode is entered at the wrong time (and I do not see a good way to deal with that). blk-mq is already complicated enough. Adding this to the block IO scheduler will unnecessarily complicate things further for no real benefits. I would like to point out the dm-zoned device mapper and f2fs which are both already dealing with write ordering and write error processing directly. Both are fairly straightforward but completely different and each optimized for their own structure. Naohiro changes to btrfs IO scheduler have the same intent, that is, efficiently integrate and handle write ordering "a la btrfs". Would creating a different "hmzoned" btrfs IO scheduler help address your concerns ? Best regards. -- Damien Le Moal Western Digital Research