support for zoned devices RFCv2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

this series adds support for zoned devices:

    https://zonedstorage.io/docs/introduction/zoned-storage

to XFS. It has been developed for and tested on both SMR hard drives,
which are the oldest and most common class of zoned devices:

   https://zonedstorage.io/docs/introduction/smr

and ZNS SSDs:

   https://zonedstorage.io/docs/introduction/zns

It has not been tested with zoned UFS devices, as their current capacity
points and performance characteristics aren't too interesting for XFS
use cases (but never say never).

Sequential write only zones are only supported for data using a new
allocator for the RT device, which maps each zone to a rtgroup which
is written sequentially.  All metadata and (for now) the log require
using randomly writable space. This means a realtime device is required
to support zoned storage, but for the common case of SMR hard drives
that contain random writable zones and sequential write required zones
on the same block device, the concept of an internal RT device is added
which means using XFS on a SMR HDD is as simple as:

$ mkfs.xfs /dev/sda
$ mount /dev/sda /mnt

When using NVMe ZNS SSDs that do not support conventional zones, the
traditional multi-device RT configuration is required.  E.g. for an
SSD with a conventional namespace 1 and a zoned namespace 2:

$ mkfs.xfs /dev/nvme0n1 -o rtdev=/dev/nvme0n2
$ mount -o rtdev=/dev/nvme0n2 /dev/nvme0n1 /mnt

The zoned allocator can also be used on conventional block devices, or
on conventional zones (e.g. when using an SMR HDD as the external RT
device).  For example using zoned XFS on normal SSDs shows very nice
performance advantages and write amplification reduction for intelligent
workloads like RocksDB.

Some work is still in progress or planned, but should not affect the
integration with the rest of XFS or the on-disk format:

 - support for quotas
 - support for reflinks - the I/O path already supports them, but
   garbage collection currently isn't refcount aware and would unshare
   them, rendering the feature useless
 - more scalable garbage collection victim selection
 - various improvements to hint based data placement

And probably a lot more after we're getting review feedback.

Note that right now Darrick is seeing issues with one off sb_frextents
after some of the prep patches.  We (well mostly him as he can actually
reprodue it) have been furiously debugging it and suspect it's just
timing changes in the series that cause it to happen, but we'd still like
to root it out, but I'd like to kick off another round of review in the
meantime.

To make testing easier a git tree is provided that has the iomap series,
this code and a few misc patches that make VM testing easier:

    git://git.infradead.org/users/hch/xfs.git xfs-zoned

The matching xfsprogs is available here:

    git://git.infradead.org/users/hch/xfsprogs.git xfs-zoned

An xfstests branch to enable the zoned code, and with various new tests
is here:

    git://git.infradead.org/users/hch/xfstests-dev.git xfs-zoned

An updated xfs-documentation branch documenting the on-disk format is
here:

    git://git.infradead.org/users/hch/xfs-documentation.git xfs-zoned

Gitweb:

    http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-zoned
    http://git.infradead.org/users/hch/xfsprogs.git/shortlog/refs/heads/xfs-zoned
    http://git.infradead.org/users/hch/xfstests-dev.git/shortlog/refs/heads/xfs-zoned
    http://git.infradead.org/users/hch/xfs-documentation.git/shortlog/refs/heads/xfs-zoned

Changes since RFC:
 - rebased to current Linus' tree that has rtrmap and rtreflink merged
 - adjust for minor changes in the iomap series
 - add one more caller of rtg_rmap
 - comment on the sb_dblocks access in statfs
 - use xfs_inode_alloc_unitsize to report dio alignments
 - improve various commit messages
 - misc spelling fixes
 - misc whitespace fixes
 - add separate helpers for raw vs always positive free space counters
 - print the pool name when reservations failed
 - return bool from xfs_zone_validate
 - use more rtg locking helpers
 - use more XFS_IS_CORRUPT
 - misc cleanups and minor renames
 - document the XFS_ZR_* constants
 - rename the IN_GC flag
 - make gc_bio.state an enum
 - don't join rtg to empty transaction in xfs_zone_gc_query
 - update copyrights
 - better inode and sb verifiers
 - allocate GC thread specific data outside the thread
 - clean up GC naming and add more comments
 - use the cmp_int trick
 - rework zone list locking a bit to avoid kmallocing under a spinlock
 - export rtstart in the fsgeometry in fsblocks
 - use buckets to speed up GC victim selection
 - stop the GC thread when freezing the file system
 - drop an assert that was racy
 - move some code additions between patches in the series
 - keep an active open zone reference for outstanding I/O
 - handle the case of all using all available open zones and an an
   open GC at shutdown at mount time correctly
 - reject zoned specific mount options for non-zoned file systems
 - export the max_open_zones limit in sysfs
 - add freecounter tracing
 - reduce metafile reservations
 - fix GC I/O splitting on devices with not LBA aligned max_sectors




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux