This series adds zoned block device support to btrfs. * Summary of changes from v2 The most significant change from v2 is the serialization of sequential block allocation and submit_bio using per block group mutex instead of waiting and sorting BIOs in a buffer. This per block group mutex now locked before allocation and released after all BIOs submission finishes. The same method is used for both data and metadata IOs. By using a mutex instead of a submit buffer, we must disable EXTENT_PREALLOC entirely in HMZONED mode to prevent deadlocks. As a result, INODE_MAP_CACHE and MIXED_BG are disabled in HMZONED mode, and relocation inode is reworked to use btrfs_wait_ordered_range() after each relocation instead of relying on preallocated file region. Furthermore, asynchronous checksum is disabled in and inline with the serialized block allocation and BIO submission. This allows preserving sequential write IO order without introducing any new functionality such as submit buffers. Async submit will be removed once we merge cgroup writeback support patch series. * Patch series description A zoned block device consists of a number of zones. Zones are either conventional and accepting random writes or sequential and requiring that writes be issued in LBA order from each zone write pointer position. This patch series ensures that the sequential write constraint of sequential zones is respected while fundamentally not changing BtrFS block and I/O management for block stored in conventional zones. To achieve this, the default chunk size of btrfs is changed on zoned block devices so that chunks are always aligned to a zone. Allocation of blocks within a chunk is changed so that the allocation is always sequential from the beginning of the chunks. To do so, an allocation pointer is added to block groups and used as the allocation hint. The allocation changes also ensure that blocks freed below the allocation pointer are ignored, resulting in sequential block allocation regardless of the chunk usage. While the introduction of the allocation pointer ensures that blocks will be allocated sequentially, I/Os to write out newly allocated blocks may be issued out of order, causing errors when writing to sequential zones. To preserve the ordering, this patch series adds some mutexes around allocation and submit_bio and serialize them. Also, this series disable async checksum and submit to avoid mixing the BIOs. The zone of a chunk is reset to allow reuse of the zone only when the block group is being freed, that is, when all the chunks of the block group are unused. For btrfs volumes composed of multiple zoned disks, a restriction is added to ensure that all disks have the same zone size. This restriction matches the existing constraint that all chunks in a block group must have the same size. * Patch series organization Patch 1 introduces the HMZONED incompatible feature flag to indicate that the btrfs volume was formatted for use on zoned block devices. Patches 2 and 3 implement functions to gather information on the zones of the device (zones type and write pointer position). Patches 4 to 8 disable features which are not compatible with the sequential write constraints of zoned block devices. These includes RAID5/6, space_cache, NODATACOW, TREE_LOG, and fallocate. Patches 9 and 10 tweak the extent buffer allocation for HMZONED mode to implement sequential block allocation in block groups and chunks. Patch 11 and 12 handles the case when write pointers of devices which compose e.g., RAID1 block group devices, are a mismatch. Patch 13 implement a zone reset for unused block groups. Patch 14 restrict the possible locations of super blocks to conventional zones to preserve the existing update in-place mechanism for the super blocks. Patches 15 to 21 implement the serialization of allocation and submit_bio for several types of IO (non-compressed data, compressed data, direct IO, and metadata). These include re-dirtying once-freed metadata blocks to prevent write holes. Patch 22 and 23 disable features which are not compatible with the serialization to prevent deadlocks. These include MIXED_BG and INODE_MAP_CACHE. Patches 24 to 26 tweak some btrfs features work with HMZONED mode. These include device-replace, relocation, and repairing IO error. Finally, patch 27 adds the HMZONED feature to the list of supported features. * Patch testing note This series is based on kdave/for-5.3-rc2. Also, you need to cherry-pick the following commits to disable write plugging with that branch. As described in commit b49773e7bcf3 ("block: Disable write plugging for zoned block devices"), without these commits, write plugging can reorder BIOs submitted from multiple contexts, e.g., multiple extent_write_cached_pages(). 0c8cf8c2a553 ("block: initialize the write priority in blk_rq_bio_prep") f924cddebc90 ("block: remove blk_init_request_from_bio") 14ccb66b3f58 ("block: remove the bi_phys_segments field in struct bio") c05f42206f4d ("blk-mq: remove blk_mq_put_ctx()") 970d168de636 ("blk-mq: simplify blk_mq_make_request()") b49773e7bcf3 ("block: Disable write plugging for zoned block devices") Furthermore, you need to apply the following patch if you run xfstests with tcmu-loop disks. xfstests btrfs/003 failed to "_devmgt_add" after "_devmgt_remove" without this patch. https://marc.info/?l=linux-scsi&m=156498625421698&w=2 You can use tcmu-runer [1] to create an emulated zoned device backed by a regular file. Here is a setup how-to: http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation [1] https://github.com/open-iscsi/tcmu-runner v2 https://lore.kernel.org/linux-btrfs/20190607131025.31996-1-naohiro.aota@xxxxxxx/ v1 https://lore.kernel.org/linux-btrfs/20180809180450.5091-1-naota@xxxxxxxxx/ Changelog v3: - Serialize allocation and submit_bio instead of bio buffering in btrfs_map_bio(). -- Disable async checksum/submit in HMZONED mode - Introduce helper functions and hmzoned.c/h (Josef, David) - Add support for repairing IO failure - Add support for NOCOW direct IO write (Josef) - Disable preallocation entirely -- Disable INODE_MAP_CACHE -- relocation is reworked not to rely on preallocation in HMZONED mode - Disable NODATACOW -Disable MIXED_BG - Device extent that cover super block position is banned (David) v2: - Add support for dev-replace -- To support dev-replace, moved submit_buffer one layer up. It now handles bio instead of btrfs_bio. -- Mark unmirrored Block Group readonly only when there are writable mirrored BGs. Necessary to handle degraded RAID. - Expire worker use vanilla delayed_work instead of btrfs's async-thread - Device extent allocator now ensure that region is on the same zone type. - Add delayed allocation shrinking. - Rename btrfs_drop_dev_zonetypes() to btrfs_destroy_dev_zonetypes - Fix -- Use SECTOR_SHIFT (Nikolay) -- Use btrfs_err (Nikolay) Naohiro Aota (27): btrfs: introduce HMZONED feature flag btrfs: Get zone information of zoned block devices btrfs: Check and enable HMZONED mode btrfs: disallow RAID5/6 in HMZONED mode btrfs: disallow space_cache in HMZONED mode btrfs: disallow NODATACOW in HMZONED mode btrfs: disable tree-log in HMZONED mode btrfs: disable fallocate in HMZONED mode btrfs: align device extent allocation to zone boundary btrfs: do sequential extent allocation in HMZONED mode btrfs: make unmirroed BGs readonly only if we have at least one writable BG btrfs: ensure metadata space available on/after degraded mount in HMZONED btrfs: reset zones of unused block groups btrfs: limit super block locations in HMZONED mode btrfs: redirty released extent buffers in sequential BGs btrfs: serialize data allocation and submit IOs btrfs: implement atomic compressed IO submission btrfs: support direct write IO in HMZONED btrfs: serialize meta IOs on HMZONED mode btrfs: wait existing extents before truncating btrfs: avoid async checksum/submit on HMZONED mode btrfs: disallow mixed-bg in HMZONED mode btrfs: disallow inode_cache in HMZONED mode btrfs: support dev-replace in HMZONED mode btrfs: enable relocation in HMZONED mode btrfs: relocate block group to repair IO failure in HMZONED btrfs: enable to mount HMZONED incompat flag fs/btrfs/Makefile | 2 +- fs/btrfs/compression.c | 5 +- fs/btrfs/ctree.h | 37 +- fs/btrfs/dev-replace.c | 155 +++++++ fs/btrfs/dev-replace.h | 3 + fs/btrfs/disk-io.c | 29 ++ fs/btrfs/extent-tree.c | 277 +++++++++++-- fs/btrfs/extent_io.c | 22 +- fs/btrfs/extent_io.h | 2 + fs/btrfs/file.c | 4 + fs/btrfs/free-space-cache.c | 35 ++ fs/btrfs/free-space-cache.h | 5 + fs/btrfs/hmzoned.c | 785 ++++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 198 +++++++++ fs/btrfs/inode.c | 88 +++- fs/btrfs/ioctl.c | 3 + fs/btrfs/relocation.c | 39 +- fs/btrfs/scrub.c | 89 +++- fs/btrfs/space-info.c | 13 +- fs/btrfs/space-info.h | 4 +- fs/btrfs/super.c | 7 + fs/btrfs/sysfs.c | 4 + fs/btrfs/transaction.c | 10 + fs/btrfs/transaction.h | 3 + fs/btrfs/volumes.c | 207 +++++++++- fs/btrfs/volumes.h | 5 + include/uapi/linux/btrfs.h | 1 + 27 files changed, 1980 insertions(+), 52 deletions(-) create mode 100644 fs/btrfs/hmzoned.c create mode 100644 fs/btrfs/hmzoned.h -- 2.22.0