Re: [PATCH v2 10/11] block: Add support for the zone capacity concept

Damien Le Moal <dlemoal@xxxxxxxxxx> · Fri, 21 Apr 2023 10:52:15 +0900

On 4/21/23 09:29, Jaegeuk Kim wrote:
> On 04/21, Damien Le Moal wrote:
>> On 4/21/23 08:44, Bart Van Assche wrote:
>>> On 4/20/23 16:37, Damien Le Moal wrote:
>>>> Why would you need to handle the max active zone number restriction in the
>>>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
>>>> though, working on it).
>>>
>>> Hi Damien,
>>>
>>> If the user (filesystem) restricts the number of active zones, the code 
>>> for restricting the number of active zones will have to be duplicated 
>>> into every filesystem that supports zoned devices. Wouldn't it be better 
>>> if the I/O scheduler tracks the number of active zones?
>>
>> I do not think so. The reason is that for a file system, the block allocator
>> must be aware of any active zone limit of the underlying device to make the best
>> decision possible regarding where to allocate blocks for files and metadata. For
> 
> Well, I'm wondering what kind of decision FS can make when allocating zones?

Not sure what you mean... It is very similar to regular block device case. The
FS does block allocation based on whatever block placement policy it wants/need
to implement. With zoned devices, the FS block management object are mapped to
zones (btrfs: block group == zone, f2fs: section == zone) and the active zone
limits simply adds one more constraint regarding the selection of block groups
for allocating blocks. This is a resource management issue.

>> btrfs, we added "active block groups" management for that purpose. And we have
>> tracking of a block group active state so that the block allocator can start
>> using new block groups (inactive ones) when previously used ones become full. We
>> also have a "finish block group" for cases when there is not enough remaining
>> free blocks in a group/zone (this does a finish zone operation to make a
>> non-full zone full, that is, inactive).
>>
>> Even if the block IO scheduler were to track active zones, the FS would still
>> need to do its own tracking (e.g. to be able to finish zones when needed). So I
> 
> Why does FS also need to track the # of open zones redundantly? I have two

Because the FS should not be issuing writes to a zone that cannot be activated
on the device, e.g. starting writing to an empty zone when there are already N
zones being written or partly written, with N >= max active zones, will result
in IO failures. Even if you have active zone tracking in the block IO scheduler,
there is absolutely NOTHING that the scheduler can do to prevent such errors.
E.g. think of this simple case:
1) Let's take a device with max active zones = N (N != 0)
2) The FS implements a block allocation policy which results in new files being
written to empty block groups, with 1 block group == 1 zone
3) User writes N files of 4KB

After step 3, the device has N active zones. So if the user tries to write a new
file, it will fail (cannot write to an empty zone as that will result in an
error because that zone cannot be activated by the device). AND the FS cannot
write metadata for these files into a metadata block group.

There is nothing that the IO scheduler can do about all this. The FS has to be
more intelligent and do block allocation also based on the current
active/inactive state of the zones used by block groups.

> concerns if FS needs to force it:
> 1) performance - waiting for finish_zone before allocating a new zone can break
> the IO pipeline and block FS operations.

The need to perform a zone finish is purely an optimization if, for instance,
you want to reduce fragmentation. E.g., if there is only 4K of free space left
in a zone and need to write a 1MB extent, you can write the last 4K of that
zone, making it inactive and write the remaining 1MB - 4KB in another zone and
you are guaranteed that this other zone can be written since you just
deactivated one zone.

But if you do not want to fragment that 1MB extent, then you must finish that
zone with only 4KB left first, to ensure that another zone can be activated.

This of course also depends on the current number of active zones N. If N < max
active zone limit, then there is no need to finish a zone.

> 2) multiple partition support - if F2FS uses two partitions, one on conventional
> partition while the other on zoned partition, we have to maintain such tracking
> mechanism on zoned partition only which gives some code complexity.

Conventional zones have no concept of active zones. All active zone resources
can be allocated to the sequential zones. And zoned block devices do not support
partitions anyway.

> In general, doesn't it make sense that FS (not zoned-device FS) just needs to
> guarantee sequential writes per zone, while IO scheduler needs to limit the #
> of open zones gracefully?

No. That will never work. See the example above: you can endup in a situation
where the drive becomes read-only (all writes failing) if the FS does not direct
block allocation & writes to zones that are already active. No amount of
trickery in the IO scheduler can change that fact.

If you want to hide the active zone limit to the FS, then what is needed is a
device mapper that remaps blocks. That is a lot more overhead that implementing
that support in the FS, and the FS can do a much better job at optimizing block
placement.