RE: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices

Matias Bjørling <Matias.Bjorling@xxxxxxx> · Tue, 15 Mar 2022 12:32:18 +0000

> >Given the above, applications have to be conscious of zones in general and
> work within their boundaries. I don't understand how applications can work
> without having per-zone knowledge. An application would have to know about
> zones and their writeable capacity. To decide where and how data is written,
> an application must manage writing across zones, specific offline zones, and
> (currently) its writeable capacity. I.e., knowledge about zones and holes is
> required for writing to zoned devices and isn't eliminated by removing the PO2
> zone size requirement.
> 
> Supporting offlines zones is optional in the ZNS spec? We are not considering
> supporting this in the host. This will be handled by the device for exactly
> maintaining the SW stack simpler.

It isn't optional. The spec allows any zones to go to Read Only or Offline state at any point in time. A specific implementation might give some guarantees to when such transitions happens, but it must nevertheless must be managed by the host software. 

Given that, and the need to not issue writes that spans zones, an application would have to aware of such behaviors. The information to make those decisions are in a zone's attributes, and thus applications would pull those, it would also know the writeable capability of a zone. So, all in all, creating support for NPO2 is something that takes a lot of work, but might have little to no impact on the overall software design. 

> >
> >For years, the PO2 requirement has been known in the Linux community and
> by the ZNS SSD vendors. Some SSD implementors have chosen not to support
> PO2 zone sizes, which is a perfectly valid decision. But its implementors
> knowingly did that while knowing that the Linux kernel didn't support it.
> >
> >I want to turn the argument around to see it from the kernel developer's point
> of view. They have communicated the PO2 requirement clearly, there's good
> precedence working with PO2 zone sizes, and at last, holes can't be avoided
> and are part of the overall design of zoned storage devices. So why should the
> kernel developer's take on the long-term maintenance burden of NPO2 zone
> sizes?
> 
> You have a good point, and that is the question we need to help answer.
> As I see it, requirements evolve and the kernel changes with it as long as there
> are active upstream users for it.

True. There's also active users for SSDs which are custom (e.g., larger than 4KiB writes required) - but they aren't supported by the Linux kernel and isn't actively being worked on to my knowledge. Which is fine, as the customers anyway uses this in their own way, and don't need the Linux kernel support.

> 
> The main constraint for (1) PO2 is removed in the block layer, we have (2) Linux hosts
> stating that unmapped LBAs are a problem, and we have (3) HW supporting
> size=capacity.
> 
> I would be happy to hear what else you would like to see for this to be of use to
> the kernel community.

(Added numbers to your paragraph above)

1. The sysfs chunksize attribute was "misused" to also represent zone size. What has changed is that RAID controllers now can use a NPO2 chunk size. This wasn't meant to naturally extend to zones, which as shown in the current posted patchset, is a lot more work.
2. Bo mentioned that the software already manages holes. It took a bit of time to get right, but now it works. Thus, the software in question is already capable of working with holes. Thus, fixing this, would present itself as a minor optimization overall. I'm not convinced the work to do this in the kernel is proportional to the change it'll make to the applications.
3. I'm happy to hear that. However, I'll like to reiterate the point that the PO2 requirement have been known for years. That there's a drive doing NPO2 zones is great, but a decision was made by the SSD implementors to not support the Linux kernel given its current implementation. 

All that said - if there are people willing to do the work and it doesn't have a negative impact on performance, code quality, maintenance complexity, etc. then there isn't anything saying support can't be added - but it does seem like it’s a lot of work, for little overall benefits to applications and the host users.