Thanks for the comments Damien. See inline. -----Original Message----- From: Damien Le Moal <Damien.LeMoal@xxxxxxx> Sent: Thursday, April 23, 2020 12:20 AM To: Pierre Labat (plabat) <plabat@xxxxxxxxxx>; fio@xxxxxxxxxxxxxxx Subject: [EXT] Re: A few change to increase test coverage for zoned devices. On 2020/04/22 3:05, Pierre Labat (plabat) wrote: > Hi, > > We (Micron San Jose CA) have a few FIO changes to propose. The general > goal of these changes is to increase the test coverage for zoned > devices. Below is a summary. > > 1. A ZNS namespace has a maximum number of open zones (aka number of > zones one can write in parallel) that may be smallish. There is > another ZNS limitation that is the maximum number of "active" zones > (active=zone in state open or closed). This number can be much bigger > than the maximum number of open zones. > > This Fio change allows to test the device in regards to this maximum > number of active zones. > > Fio is given a maximum number of active zones. The threads/jobs (their > number is limited to the maximum number open zones) write a bit into > one active zone, close it, jumps to another active zone (open it > (write implicit open)) and so on. The writing threads keep ping > ponging across the active zones writing a bit each time and then > closing the zone. An active zone is re-opened when a writing thread > write again in the zone (implicit open). As a consequence, the write > load is spread across all active zones while never passing the max number of active zones. When you say "close the zone" do you mean an explicit close ? If yes, that is one more system call (ioctl) and NVMe command to issue for every single write command. Performance will be horrible. Pierre> Agreed. This is a way to test explicit close. But beyond that no interest. <Pierre > Then at some point some active zones get full. When that happens they > are not "active" anymore. Fio selects other zones automatically, they > will become active on the first write. As a consequence, over time, > the active zones move across the namespace (but they stay in the > window specified by fio). That gives a good workout to the device > running it at its max limit of active zones and jumping (to write in a > zone) at high rate from one zone to another. All of this is the exact description (minus the "close the zone") of what the current max_open_zone=X option does. It will select X zones for writing and will keep writing these zones until they are full, at which point, other zones are selected. This means that you will always get the number of active zones and the number of implicitly open zones to be equal. If the fio command line specifies max_open_zones=X with X <= max active zones for the device, there will be no write IO rejected by the drive. Pierre> That works, using the device ability to automatically close zones as the max number of open zones is passed. <Pierre > > 2. An application can "finish" a zone without writing it in full. For > example an app could only write half a zone and then finish it. That > changes the state of the zone to "full". The app cannot write anymore > in the zone. The zone will have to be reset at some point. > > We have a change in FIO that allows to test that. A new option tells > FIO to stop writing in a zone when reaching some threshold and to > "finish" it. At that point, FIO sends a zone management command to > finish the zone and consider the zone full (even if it is not actually full of app data). We could indeed add this fairly easily as that is how zone reset rate also works. However, I personally do not see any good use case for the finish operation. "because we can" not being the best justification for new code, it may be good to put forward a use case. Of note is that fio is a performance measurement tool, not a drive test tool, so implementing this for "testing" the finish operation does not sound to me like a good idea either. Pierre> About a use case, that would be a program filling up a number of zones with the last one not being perfectly full. That set of zones corresponds to an item of same lifespan. As this item is getting old or predicted to get [very] old that last zoned would be finished. That would allow another zone to become active. <Pierre > 3. Another change is relaxing the checks in zbd_verify_sizes() in > regards to read IO on zoned devices. The reading can start anywhere in > a zone (below the WP), it doesn't need to always start at the zone beginning. This is indeed true for sequential read workloads. But not for random reads. Are you referring to sequential read ? That should be easy to fix, but again, the usefulness of this is not clear to me. Pierre> About a use case. A program has some metadata pointing to some offset in the middle of zone. The program is going to start reading sequentially from there [using read ahead]. <Pierre > I would have to move these changes on latest fio and do some cleanup > before having patches ready. ZNS needs zone capacity support too. We have patches ready to go for that one. Waiting for kernel side patches first though, otherwise, testing is not possible. Pierre> I see. Good to know. <Pierre > > Let me know if that make sense to add the above to fio. > > Best regards, > > Pierre > > -- Damien Le Moal Western Digital Research