On 2019/06/18 22:34, Josef Bacik wrote: > On Mon, Jun 17, 2019 at 03:16:05AM +0000, Damien Le Moal wrote: >> The block I/O scheduler reorders requests in LBA order, but that happens for a >> newly inserted request against pending requests. If there are no pending >> requests because all requests were already issued, no ordering happen, and even >> worse, if the drive queue is not full yet (e.g. there are free tags), then the >> newly inserted request will be dispatched almost immediately, preventing >> reordering with subsequent incoming write requests to happen. >> > > This sounds like we're depending on specific behavior from the ioscheduler, > which means we're going to have a sad day at some point in the future. In a sense yes, we are. But my team and I always make sure that such sad day do not come. We are always making sure that HM-zoned drives can be used and work as expected (all RCs and stable versions are tested weekly). For now, getting guarantees on write requests order mandates the use of the mq-deadline scheduler as it is currently the only one providing these guarantees. I just sent a patch to ensure that this scheduler is always available with CONFIG_BLK_DEV_ZONED enabled (see commit b9aef63aca77 "block: force select mq-deadline for zoned block devices") and automatically configuring it for HM zoned devices is simply a matter of adding an udev rule to the system (mq-deadline is the default scheduler for spinning rust anyway). >> The other problem is that the mq-deadline scheduler does not track zone WP >> position. Write request issuing is done regardless of the current WP value, >> solely based on LBA ordering. This means that mq-deadline will not prevent >> out-of-order, or rather, unaligned write requests. These will not be detected >> and dispatched whenever possible. The reasons for this are that: >> 1) the disk user (the FS) has to manage zone WP positions anyway. So duplicating >> that management at the block IO scheduler level is inefficient. > > I'm not saying it has to manage the WP pointer, and in fact I'm not saying the > scheduler has to do anything at all. We just need a more generic way to make > sure that bio's submitted in order are kept in order. So perhaps a hmzoned > scheduler that does just that, and is pinned for these devices. This is exactly what mq-deadline does for HM devices: it guarantees that write bio order submission is kept as is for request dispatching to the disk. The only missing part is "pinned for these devices". This is not possible now. A user can still change the scheduler to say BFQ. But in that case, unaligned write errors will show up very quickly. So this is easy to debug. Not ideal I agree, but that can be fixed independently of BtrFS support for hmzoned disks. >> 2) Adding zone WP management at the block IO scheduler level would also need a >> write error processing path to resync the WP value in case of failed writes. But >> the user/FS also needs that anyway. Again duplicated functionalities. > > Again, no not really. My point is I want as little block layer knowledge in > btrfs as possible. I accept we should probably keep track of the WP, it just > makes it easier on everybody if we allocate sequentially. I'll even allow that > we need to handle the write errors and adjust our WP stuff internally when > things go wrong. > > What I'm having a hard time swallowing is having a io scheduler in btrfs proper. > We just ripped out the old one we had because it broke cgroups. It just adds > extra complexity to an already complex mess. I understand your point. It makes perfect sense. The "IO scheduler" added for hmzoned case is only the method proposed to implement sequential write issuing guarantees. The sequential allocation was relatively easy to achieve, but what is really needed is an atomic "sequential alloc blocks + issue write BIO for these blocks" so that the block IO schedulker sees sequential write streams per zone. If only the sequential allocation is achieved, write bios serving these blocks may be reordered at the FS level and result in write failures since the block layer scheduler only guarantees preserving the order without any reordering guarantees for unaligned writes. >> 3) The block layer will need a timeout to force issue or cancel pending >> unaligned write requests. This is necessary in case the drive user stops issuing >> writes (for whatever reasons) or the scheduler is being switched. This would >> unnecessarily cause write I/O errors or cause deadlocks if the request queue >> quiesce mode is entered at the wrong time (and I do not see a good way to deal >> with that). > > Again we could just pin the hmzoned scheduler to those devices so you can't > switch them. Or make a hmzoned blk plug and pin no scheduler to these devices. That is not enough. Pinning the schedulers or using the plugs cannot guarantee that write requests issued out of order will always be correctly reordered. Even worse, we cannot implement this. For multiple reason as I stated before. One example that may illustrates this more easily is this: imagine a user doing buffered I/Os to an hm disk (e.g. dd if=/dev/zero of=/dev/sdX). The first part of this execution, that is, allocate a free page, copy the user data and add the page to the page cache as dirty, is in fact equivalent to an FS sequential block allocation (the dirty pages are allocated in offset order and added to the page cache in that same order). Most of the time, this will work just fine because the page cache dirty page writeback code is mostly sequential. Dirty pages for an inode are found in offset order, packed into write bios and issued sequentially. But start putting memory pressure on the system, or executing "sync" or other applications in parallel, and you will start seeing unaligned write errors because the page cache atomicity is per page so different contexts may end up grabbing dirty pages in order (as expected) but issuing interleaved write bios out of order. And this type of problem *cannot* be handled in the block layer (plug or scheduler) because stopping execution of a bio expecting that another bio will come is very dangerous as there are no guarantees that such bio will ever be issued. In the case of the page cache flush, this is actually a real eventuality as memory allocation needed for issuing a bio may depend on the completion of already issued bios, and if we cannot dispatch those, then we can deadlock. This is an extreme example. This is unlikely but still a real possibility. Similarly to your position, that is, the FS should not know anything about the block layer, the block layer position is that it cannot rely on a specific behavior from the upper layers. Essentially, all bios are independent and treated as such. For HM devices, we needed sequential write guarantees, but could not break the independence of write requests. So what we did is simply guarantee that the dispatch order is preserved from the issuing order, nothing else. There is no "buffering" possible and no checks regarding the sequentiality of writes. As a result, the sequential write constraint of the disks is directly exposed to the disk user (FS or DM). >> blk-mq is already complicated enough. Adding this to the block IO scheduler will >> unnecessarily complicate things further for no real benefits. I would like to >> point out the dm-zoned device mapper and f2fs which are both already dealing >> with write ordering and write error processing directly. Both are fairly >> straightforward but completely different and each optimized for their own structure. >> > > So we're duplicating this effort in 2 places already and adding a 3rd place > seems like a solid plan? Device-mapper it makes sense, we're sitting squarely > in the block layer so moving around bio's/requests is its very reason for > existing. I'm not sold on the file system needing to take up this behavior. > This needs to be handled in a more generic way so that all file systems can > share the same mechanism. I understand your point. But I am afraid it is not easily possible. The reason is that for an FS, to achieve sequential write streams in zones, one need an atomic (or serialized) execution of "block allocation + wrtite bio issuing". Both combined achieve a sequential write stream that mq-deadline will preserve and everything will work as intended. This is obviously not easily possible in a generic manner for all FSes. In f2fs, this was rather easy to do without changing a lot of code by simply using a mutex to have the 2 operations atomically executed without any noticeable performance impact. A similar method in BtrFS is not possible because of async checksum and async compression which can result in btrfs_map_bio() execution in an order that is different from the extent allocation order. > > I'd even go so far as to say that you could just require using a dm device with > these hmzoned block devices and then handle all of that logic in there if you > didn't feel like doing it generically. We're already talking about esoteric > devices that require special care to use, adding the extra requirement of > needing to go through device-mapper to use it wouldn't be that big of a stretch. HM drives are not so "esoteric" anymore. Entire data centers are starting running on them. And getting BtrFS to work natively on HM drives would be a huge step toward facilitating their use, and remove this "esoteric" label :) Back to your point, using a dm to do the reordering is possible, but requires temporary persistent backup of the out-of-order BIOs due to the reasons pointed out above (dependency of memory allocation failure/success on bio completion). This is basically what dm-zoned does, using conventional zones to store out-of-order writes in conventional zones. Such generic DM is enough to run any file system (ext4 or XFS run perfectly fine on dm-zoned), but come at the cost of needing garbage collection with a huge impact on performance. The simple addition of Naohiro's write bio ordering feature in BtrFS avoids all this and preserves performance. I really understand your desire to reduce complexity. But in the end, this is only a "sorted list" that is well controlled within btrfs itself and avoids dependency on the behavior of other components beside the block IO scheduler. We could envision to make such feature generic, implementing it as a block layer object. But its use would still be needed in btrfs. Since f2fs and dm-zoned do not require it, btrfs would be the sole user though, so for now at least, this generic implementation has I think little value. We can work on trying to isolate this bio reordering code more so that it is easier to remove and use a future generic implementation. Would that help in addressing your concerns ? Thank you for your comments. Best regards. -- Damien Le Moal Western Digital Research