[PATCH 00/13] btrfs: zoned: fix active zone tracking issues

Naohiro Aota <naohiro.aota@xxxxxxx> · Sat, 9 Jul 2022 08:18:37 +0900

This series addresses mainly two issues on zoned btrfs' active zone
tracking and one issue which is a dependency of the main issue.

* ChangeLog
- v2
  - Support sanity tests (Johannes)
    - fs_info can be NULL while it is running sanity tests. Consider that
      case in CONFIG_FS_BTRFS_RUN_SANITY_TESTS.
  - Propagete an error of btrfs_zone_finish() (Johannes)
  - Add a comment to max_segments limitation (Christoph)
  - Rename btrfs_finish_one_bg() to btrfs_zone_finish_one_bg() to make the
    it clear it is related to zoned code.
  - Do not reduce active_total_bytes when finishing a block group.
    - While it's no longer active, but it still can have "used" bytes. So,
      it should be counted to host "total_bytes". Or, it breaks free space
      calculation.
  - Do not try to activate a fully allocated block group.

* Background

A ZNS drive has an upper limit of zones that simultaneously can be written
out. We call the limit max_active_zones. An active zone is deactivated when
we write fully to the zone, or when we explicitly send a REQ_OP_ZONE_FINISH
command to make it full.

The zoned btrfs must be aware of max_active_zones to use a ZNS drive. So,
we have an active zone tracking system that considers a block group as
active iff the underlying zone is active. In fact, we consider a block
group (and its underlying zones) as active when we start allocating from
it. Then, when the last region which can be allocated in the block group is
written, we send a REQ_OP_ZONE_FINISH command to each zone and consider the
block group as inactive.

So, in short, we currently depend on writing fully to a zone to finish a block group.

* Issues
** Issue A

In a certain situation, the current zoned btrfs's extent allocation fails
with an early -ENOSPC on a ZNS drive. When all the block groups do not have
enough space left for the allocation, it tries to allocate a new block
group if we can activate a new zone. If not, it returns -ENOSPC while the
device still has free space left.

** Issue B

When doing a buffered write, we call cow_file_range() to allocate the data
extent. The cow_file_range() works like an all-or-nothing manner: if it can
allocate for all the range it returns 0, or -ENOSPC if not. Thus, when all
the block group have small free space left, and btrfs cannot finish any
block group, the allocation partly succeed but fails in the end. This also
results in an early -ENOSPC.

We cannot finish any block group in a certain situation. Let's consider
that we have 8 active data block groups (forget about metadata/system block
groups here) and each of them has 1 MB free space left. Now, we want to do
10 MB buffered write. We can allocate blocks for the 8 of 10 MB. And, we
can no longer allocate from any block group. Furthermore, we cannot finish
any block group, because all the block groups have 1 MB reserved unwritten
space left now. And, since this 1 MB regions are owned by the allocating
process itself, simply waiting for the region to be written won't work.

** Issue C

To address issue A, we needed to disable metadata reservation
over-commit. That reveals that we under-estimate the number of extents to
be written on zoned btrfs. On zoned btrfs, we use a ZONE APPEND command to
write data, whose bio size is limited by max_zone_append_sectors and
max_segments. So, a data extent is always split at most at the size of the
limit. As a result, if BTRFS_MAX_EXTENT_SIZE is larger than the limit, we
tend to have more extents than expected from the estimation using
BTRFS_MAX_EXTENT_SIZE.

Since the metadata reservation is done before allocation (e.g, at
btrfs_buffered_write) and released afterward along with the delalloc
process or ordered extent creation. As a result, we can be short of the
metadata reservation in a certain situation, and can cause a WARN by that.

* Solutions
** For issue A

Issue A is that we can have early -ENOSPC if we cannot activate another
block group and no block group has enough space left.

To avoid the early -ENOSPC, we need to choose one block group and finish it
to make rooms for a new block group to be activated. But, that is only
possible from the data extent allocation context. From the metadata
context, we can cause a deadlock because we might need to wait for a
running transaction to make the finishing block group read-only.

So, we use two different methods for data allocation and metadata
allocation. For data allocation, we can finish a block group on-demand from
btrfs_reserve_extent() context. The finishing block group will be the block
group with a least free space left.

For metadata allocation, we use flush_space() to ensure that reserved bytes
can be written into active block groups. To do so, we track active block
groups' total bytes as active_total_bytes, and activate a block group
on-demand from flush_space().

Also, a newly allocated block group from some contexts must be activated

** For issue B

Issue B is about when we cannot allocate space from any block group, and we
cannot finish any block group. This issue only occurs when allocating a
data extent, because metadata reservation is ensured to be contained in
active block groups by solution for issue A.

In this case, writing out the partially allocated region will close the gap
between the allocation pointer and the capacity of the block group, make
the zone finished, and opens up rooms to activate a new block group. So,
this series implements the partial writing out and retrying of the
alloction.

In a certain case, we can't allocate anything from the block groups. In
that case, we'd expect there is on-going IOs to finish a block group. So,
we wait for it and retry the allocation.

** For issue C

Issue C is about that we underestimate the number of extents to be written
on zoned btrfs, because we don't expect an ordered extent is split by the
size of a bio.

We need to use a proper extent size limit to fix issue C. For that, we
revive the fs_info->max_zone_append_size and use it to calculate
count_max_extents(). Technically, the bio size is also limited by the
max_segments, so the limit is also capped by it.

* Patch structure

The fix for issue C comes first because it is a dependency of the fixes for
issue A and B.

Patches 1 to 5 address issue C by reviving fs_info->max_zone_append_bytes
and use it to replace BTRFS_MAX_EXTENT_SIZE on zoned btrfs.

Patches 6 to 11 address issue A. In detail, patch 7 fixes the data
allocation by finishing a block group when we cannot activate another block
group. Patch 10 fixes the metadata allocation by finishing a block group at
space reservation time.

Patches 12 and 13 address issue B by writing out a successfully allocated
part first and retrying the rest allocation.

Naohiro Aota (13):
  block: add bdev_max_segments() helper
  btrfs: zoned: revive max_zone_append_bytes
  btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
  btrfs: convert count_max_extents() to use fs_info->max_extent_size
  btrfs: use fs_info->max_extent_size in get_extent_max_capacity()
  btrfs: let can_allocate_chunk return int
  btrfs: zoned: finish least available block group on data BG allocation
  btrfs: zoned: introduce space_info->active_total_bytes
  btrfs: zoned: disable metadata overcommit for zoned
  btrfs: zoned: activate metadata BG on flush_space
  btrfs: zoned: activate necessary block group
  btrfs: zoned: write out partially allocated region
  btrfs: zoned: wait until zone is finished when allocation didn't
    progress

 fs/btrfs/block-group.c    |  28 ++++++++-
 fs/btrfs/ctree.h          |  30 ++++++---
 fs/btrfs/delalloc-space.c |   6 +-
 fs/btrfs/disk-io.c        |   3 +
 fs/btrfs/extent-tree.c    |  70 ++++++++++++++++-----
 fs/btrfs/extent_io.c      |   8 ++-
 fs/btrfs/inode.c          |  90 +++++++++++++++++++--------
 fs/btrfs/ioctl.c          |  11 ++--
 fs/btrfs/space-info.c     |  76 ++++++++++++++++++++---
 fs/btrfs/space-info.h     |   4 +-
 fs/btrfs/zoned.c          | 124 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h          |  18 ++++++
 include/linux/blkdev.h    |   5 ++
 13 files changed, 404 insertions(+), 69 deletions(-)

-- 
2.35.1