Mike, On 6/15/18 02:58, Mike Snitzer wrote: > On Thu, Jun 14 2018 at 1:37pm -0400, > Luis R. Rodriguez <mcgrof@xxxxxxxxxx> wrote: > >> On Thu, Jun 14, 2018 at 08:38:06AM -0400, Mike Snitzer wrote: >>> On Wed, Jun 13 2018 at 8:11pm -0400, >>> Luis R. Rodriguez <mcgrof@xxxxxxxxxx> wrote: >>> >>>> Setting up a zoned disks in a generic form is not so trivial. There >>>> is also quite a bit of tribal knowledge with these devices which is not >>>> easy to find. >>>> >>>> The currently supplied demo script works but it is not generic enough to be >>>> practical for Linux distributions or even developers which often move >>>> from one kernel to another. >>>> >>>> This tries to put a bit of this tribal knowledge into an initial udev >>>> rule for development with the hopes Linux distributions can later >>>> deploy. Three rule are added. One rule is optional for now, it should be >>>> extended later to be more distribution-friendly and then I think this >>>> may be ready for consideration for integration on distributions. >>>> >>>> 1) scheduler setup >>> >>> This is wrong.. if zoned devices are so dependent on deadline or >>> mq-deadline then the kernel should allow them to be hardcoded. I know >>> Jens removed the API to do so but the fact that drivers need to rely on >>> hacks like this udev rule to get a functional device is proof we need to >>> allow drivers to impose the scheduler used. >> >> This is the point to the patch as well, I actually tend to agree with you, >> and I had tried to draw up a patch to do just that, however its *not* possible >> today to do this and would require some consensus. So from what I can tell >> we *have* to live with this one or a form of it. Ie a file describing which >> disk serial gets deadline and which one gets mq-deadline. >> >> Jens? >> >> Anyway, let's assume this is done in the kernel, which one would use deadline, >> which one would use mq-deadline? > > The zoned storage driver needs to make that call based on what mode it > is in. If it is using blk-mq then it selects mq-deadline, otherwise > deadline. As Bart pointed out, deadline is an alias of mq-deadline. So using "deadline" as the scheduler name works in both legacy and mq cases. >>>> 2) backlist f2fs devices >>> >>> There should porbably be support in dm-zoned for detecting whether a >>> zoned device was formatted with f2fs (assuming there is a known f2fs >>> superblock)? >> >> Not sure what you mean. Are you suggesting we always setup dm-zoned for >> all zoned disks and just make an excemption on dm-zone code to somehow >> use the disk directly if a filesystem supports zoned disks directly somehow? > > No, I'm saying that a udev rule wouldn't be needed if dm-zoned just > errored out if asked to consume disks that already have an f2fs > superblock. And existing filesystems should get conflicting superblock > awareness "for free" if blkid or whatever is trained to be aware of > f2fs's superblock. Well that is the case already: on startup, dm-zoned will read its own metadata from sector 0, same as f2fs would do with its super-block. If the format/magic does not match expected values, dm-zoned will bail out and return an error. dm-zoned metadata and f2fs metadata reside in the same place and overwrite each other. There is no way to get one working on top of the other. I do not see any possibility of a problem on startup. But definitely, the user land format tools can step on each other toes. That needs fixing. >> f2fs does not require dm-zoned. What would be required is a bit more complex >> given one could dedicate portions of the disk to f2fs and other portions to >> another filesystem, which would require dm-zoned. >> >> Also filesystems which *do not* support zoned disks should *not* be allowing >> direct setup. Today that's all filesystems other than f2fs, in the future >> that may change. Those are bullets we are allowing to trigger for users >> just waiting to shot themselves on the foot with. >> >> So who's going to work on all the above? > > It should take care of itself if existing tools are trained to be aware > of new signatures. E.g. ext4 and xfs already are aware of one another > so that you cannot reformat a device with the other unless force is > given. > > Same kind of mutual exclussion needs to happen for zoned devices. Yes. > So the zoned device tools, dm-zoned, f2fs, whatever.. they need to be > updated to not step on each others toes. And other filesystems' tools > need to be updated to be zoned device aware. I will update dm-zoned tools to check for known FS superblocks, similarly to what mkfs.ext4 and mkfs.xfs do. >>>> 3) run dmsetup for the rest of devices >>> >>> automagically running dmsetup directly from udev to create a dm-zoned >>> target is very much wrong. It just gets in the way of proper support >>> that should be add to appropriate tools that admins use to setup their >>> zoned devices. For instance, persistent use of dm-zoned target should >>> be made reliable with a volume manager.. >> >> Ah yes, but who's working on that? How long will it take? > > No idea, as is (from my vantage point) there is close to zero demand for > zoned devices. It won't be a priority until enough customers are asking > for it. >From my point of view (drive vendor), things are different. We do see an increasing interest for these drives. However, most use cases are still limited to application based direct disk access with minimal involvement from the kernel and so few "support" requests. Many reasons to this, but one is to some extent the current lack of extended support by the kernel. Despite all the recent work done, as Luis experienced, zoned drives are still far harder to easily setup than regular disks. Chicken and egg situation... >> I agree it is odd to expect one to use dmsetup and then use a volume manager on >> top of it, if we can just add proper support onto the volume manager... then >> that's a reasonable way to go. >> >> But *we're not there* yet, and as-is today, what is described in the udev >> script is the best we can do for a generic setup. > > Just because doing things right takes work doesn't mean it makes sense > to elevate this udev script to be packaged in some upstream project like > udev or whatever. Agree. Will start looking into better solutions now that at least one user (Luis) complained. The customer is king. >>> In general this udev script is unwelcome and makes things way worse for >>> the long-term success of zoned devices. >> >> dm-zoned-tools does not acknowledge in any way a roadmap, and just provides >> a script, which IMHO is less generic and less distribution friendly. Having >> a udev rule in place to demonstrate the current state of affairs IMHO is >> more scalable demonstrates the issues better than the script. >> >> If we have an agreed upon long term strategy lets document that. But from >> what I gather we are not even in consensus with regards to the scheduler >> stuff. If we have consensus on the other stuff lets document that as >> dm-zoned-tools is the only place I think folks could find to reasonably >> deploy these things. > > I'm sure Damien and others will have something to say here. Yes. The scheduler setup pain is real. Jens made it clear that he prefers a udev rule. I fully understand his point of view, yet, I think an automatic switch in the block layer would be far easier and generate a lot less problem for users, and likely less "bug report" to distributions vendors (and to myself too). That said, I also like to see the current dependency of zoned devices on the deadline scheduler as temporary until a better solution for ensuring write ordering is found. After all, requiring deadline as the disk scheduler does impose other limitations on the user. Lack of I/O priority support and no cgroup based fairness are two examples of what other schedulers provide but is lost with forcing deadline. The obvious fix is of course to make all disk schedulers zone device aware. A little heavy handed, probably lots of duplicated/similar code, and many more test cases to cover. This approach does not seem sustainable to me. We discussed other possibilities at LSF/MM (specialized write queue in multi-queue path). One could also think of more invasive changes to the block layer (e.g. adding an optional "dispatcher" layer to tightly control command ordering ?). And probably a lot more options, But I am not yet sure what an appropriate replacement to deadline would be. Eventually, the removal of the legacy I/O path may also be the trigger to introduce some deeper design changes to blk-mq to accommodate more easily zoned block devices or other non-standard block devices (open channel SSDs for instance). As you can see from the above, working with these drives all day long does not make for a clear strategy. Inputs from other here are more than welcome. I would be happy to write up all the ideas I have to start a discussion so that we can come to a consensus and have a plan. >>> I don't dispute there is an obvious void for how to properly setup zoned >>> devices, but this script is _not_ what should fill that void. >> >> Good to know! Again, consider it as an alternative to the script. >> >> I'm happy to adapt the language and supply it only as an example script >> developers can use, but we can't leave users hanging as well. Let's at >> least come up with a plan which we seem to agree on and document that. > > Best to try to get Damien and others more invested in zoned devices to > help you take up your cause. I think it is worthwhile to develop a > strategy. But it needs to be done in terms of the norms of the existing > infrastructure we all make use of today. So first step is making > existing tools zoned device aware (even if to reject such devices). Rest assured that I am fully invested in improving the existing infrastructure for zoned block devices. As mentioned above, applications based use of zoned block devices still prevails today. So I do tend to work more on that side of things (libzbc, tcmu, sysutils for instance) rather than on a better integration with more advanced tools (such as LVM) relying on kernel features. I am however seeing rising interest in file systems and also in dm-zoned. So definitely it is time to step up work in that area to further simplify using these drives. Thank you for the feedback. Best regards. -- Damien Le Moal, Western Digital