Re: [EXT] Re: A few change to increase test coverage for zoned devices.

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Tue, 28 Apr 2020 02:11:02 +0000

On 2020/04/28 6:28, Pierre Labat (plabat) wrote:
> 
> Thanks for the comments Damien. See inline.
> 
> -----Original Message----- From: Damien Le Moal <Damien.LeMoal@xxxxxxx> Sent:
> Thursday, April 23, 2020 12:20 AM To: Pierre Labat (plabat)
> <plabat@xxxxxxxxxx>; fio@xxxxxxxxxxxxxxx Subject: [EXT] Re: A few change to
> increase test coverage for zoned devices.
> 
> On 2020/04/22 3:05, Pierre Labat (plabat) wrote:
>> Hi,
>> 
>> We (Micron San Jose CA) have a few FIO changes to propose. The general goal
>> of these changes is to increase the test coverage for zoned devices. Below
>> is a summary.
>> 
>> 1. A ZNS namespace has a maximum number of open zones (aka number of zones
>> one can write in parallel) that may be smallish. There is another ZNS
>> limitation that is the maximum number of "active" zones (active=zone in
>> state open or closed). This number can be much bigger than the maximum
>> number of open zones.
>> 
>> This Fio change allows to test the device in regards to this maximum number
>> of active zones.
>> 
>> Fio is given a maximum number of active zones. The threads/jobs (their 
>> number is limited to the maximum number open zones) write a bit into one
>> active zone, close it, jumps to another active zone (open it (write
>> implicit open)) and so on. The writing threads keep ping ponging across the
>> active zones writing a bit each time and then closing the zone. An active
>> zone is re-opened when a writing thread write again in the zone (implicit
>> open). As a consequence, the write load is spread across all active zones
>> while never passing the max number of active zones.
> 
> When you say "close the zone" do you mean an explicit close ? If yes, that is
> one more system call (ioctl) and NVMe command to issue for every single write
> command. Performance will be horrible.
> 
> Pierre> Agreed. This is a way to test explicit close. But beyond that no
> interest. <Pierre

For testing a device function, like zone explicit close), something like nvmecli
and the new libnvme it is built on will be far better tools. fio is for
performance testing, not device conformance testing. Granted, it ends up doing
that indirectly by exercising the drive, but that does not mean that we should
patch fio for doing drive functional tests.

> 
>> Then at some point some active zones get  full. When that happens they are
>> not "active" anymore. Fio selects other zones automatically, they will
>> become active on the first write. As a consequence, over time, the active
>> zones move across the namespace (but they stay in the window specified by
>> fio). That gives a good workout to the device running it at its max limit
>> of active zones and jumping (to write in a zone) at high rate from one zone
>> to another.
> 
> All of this is the exact description (minus the "close the zone") of what the
> current max_open_zone=X option does. It will select X zones for writing and
> will keep writing these zones until they are full, at which point, other
> zones are selected. This means that you will always get the number of active
> zones and the number of implicitly open zones to be equal. If the fio command
> line specifies max_open_zones=X with X <= max active zones for the device,
> there will be no write IO rejected by the drive.
> 
> Pierre> That works, using the device ability to automatically close zones as
> the max number of open zones is passed. <Pierre
> 
>> 
>> 2. An application can "finish" a zone without writing it in full. For 
>> example an app could only write half a zone and then finish it. That 
>> changes the state of the zone to "full". The app cannot write anymore in
>> the zone. The zone will have to be reset at some point.
>> 
>> We have a change in FIO that allows to test that. A new option tells FIO to
>> stop writing in a zone when reaching some threshold and to "finish" it. At
>> that point, FIO sends a zone management command to finish the zone and
>> consider the zone full (even if it is not actually full of app data).
> 
> We could indeed add this fairly easily as that is how zone reset rate also
> works. However, I personally do not see any good use case for the finish
> operation. "because we can" not being the best justification for new code, it
> may be good to put forward a use case. Of note is that fio is a performance
> measurement tool, not a drive test tool, so implementing this for "testing"
> the finish operation does not sound to me like a good idea either.
> 
> Pierre> About a use case, that would be a program filling up a number of
> zones with the last one not being perfectly full. That set of zones
> corresponds to an item of same lifespan. As this item is getting old or
> predicted to get [very] old that last zoned would be finished. That would
> allow another zone to become active. <Pierre

Well, since offset & io_size get aligned to zones, I cannot think of a case
where fio would leave a zone not full. Not 100% sure about it though. Would need
to check again. But yes, in the context of a very limited number of active zones
for a device, adding the capability to finish zones would be OK I guess.

>> 3. Another change is relaxing the checks in zbd_verify_sizes() in regards
>> to read IO on zoned devices. The reading can start anywhere in a zone
>> (below the WP), it doesn't need to always start at the zone beginning.
> 
> This is indeed true for sequential read workloads. But not for random reads.
> Are you referring to sequential read ? That should be easy to fix, but again,
> the usefulness of this is not clear to me.
> 
> Pierre> About a use case. A program has some metadata pointing to some offset
> in the middle of zone. The program is going to start reading sequentially
> from there [using read ahead]. <Pierre

OK. Valid I think, but putting this in the context of fio as a performance tool,
does it matter if reads start from a zone boundary as opposed to within a zone ?
Performance should be the same, for a decent device that is. It certainly is the
same for SMR HDDs, and I do not expect measurable differences for ZNS drives either.

The problem with not forcing zone alignment is with the interaction with verify
and zone resets for read+write workloads. Everything becomes very difficult,
especially verify handling. While I may be wrong, it looks to me like a lot of
changes for not much gains in terms of performance evaluation capabilities. But
please feel free to prove me wrong by sending patches :)

-- 
Damien Le Moal
Western Digital Research