On 1/15/20 7:28 AM, Damien Le Moal wrote: > zonefs is a very simple file system exposing each zone of a zoned block > device as a file. Unlike a regular file system with zoned block device > support (e.g. f2fs), zonefs does not hide the sequential write > constraint of zoned block devices to the user. Files representing > sequential write zones of the device must be written sequentially > starting from the end of the file (append only writes). > > As such, zonefs is in essence closer to a raw block device access > interface than to a full featured POSIX file system. The goal of zonefs > is to simplify the implementation of zoned block device support in > applications by replacing raw block device file accesses with a richer > file API, avoiding relying on direct block device file ioctls which may > be more obscure to developers. One example of this approach is the > implementation of LSM (log-structured merge) tree structures (such as > used in RocksDB and LevelDB) on zoned block devices by allowing SSTables > to be stored in a zone file similarly to a regular file system rather > than as a range of sectors of a zoned device. The introduction of the > higher level construct "one file is one zone" can help reducing the > amount of changes needed in the application as well as introducing > support for different application programming languages. > > Zonefs on-disk metadata is reduced to an immutable super block to > persistently store a magic number and optional feature flags and > values. On mount, zonefs uses blkdev_report_zones() to obtain the device > zone configuration and populates the mount point with a static file tree > solely based on this information. E.g. file sizes come from the device > zone type and write pointer offset managed by the device itself. > > The zone files created on mount have the following characteristics. > 1) Files representing zones of the same type are grouped together > under a common sub-directory: > * For conventional zones, the sub-directory "cnv" is used. > * For sequential write zones, the sub-directory "seq" is used. > These two directories are the only directories that exist in zonefs. > Users cannot create other directories and cannot rename nor delete > the "cnv" and "seq" sub-directories. > 2) The name of zone files is the number of the file within the zone > type sub-directory, in order of increasing zone start sector. > 3) The size of conventional zone files is fixed to the device zone size. > Conventional zone files cannot be truncated. > 4) The size of sequential zone files represent the file's zone write > pointer position relative to the zone start sector. Truncating these > files is allowed only down to 0, in which case, the zone is reset to > rewind the zone write pointer position to the start of the zone, or > up to the zone size, in which case the file's zone is transitioned > to the FULL state (finish zone operation). > 5) All read and write operations to files are not allowed beyond the > file zone size. Any access exceeding the zone size is failed with > the -EFBIG error. > 6) Creating, deleting, renaming or modifying any attribute of files and > sub-directories is not allowed. > 7) There are no restrictions on the type of read and write operations > that can be issued to conventional zone files. Buffered, direct and > mmap read & write operations are accepted. For sequential zone files, > there are no restrictions on read operations, but all write > operations must be direct IO append writes. mmap write of sequential > files is not allowed. > > Several optional features of zonefs can be enabled at format time. > * Conventional zone aggregation: ranges of contiguous conventional > zones can be aggregated into a single larger file instead of the > default one file per zone. > * File ownership: The owner UID and GID of zone files is by default 0 > (root) but can be changed to any valid UID/GID. > * File access permissions: the default 640 access permissions can be > changed. > > The mkzonefs tool is used to format zoned block devices for use with > zonefs. This tool is available on Github at: > > git@xxxxxxxxxx:damien-lemoal/zonefs-tools.git. > > zonefs-tools also includes a test suite which can be run against any > zoned block device, including null_blk block device created with zoned > mode. > > Example: the following formats a 15TB host-managed SMR HDD with 256 MB > zones with the conventional zones aggregation feature enabled. > > $ sudo mkzonefs -o aggr_cnv /dev/sdX > $ sudo mount -t zonefs /dev/sdX /mnt > $ ls -l /mnt/ > total 0 > dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv > dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq > > The size of the zone files sub-directories indicate the number of files > existing for each type of zones. In this example, there is only one > conventional zone file (all conventional zones are aggregated under a > single file). > > $ ls -l /mnt/cnv > total 137101312 > -rw-r----- 1 root root 140391743488 Nov 25 13:23 0 > > This aggregated conventional zone file can be used as a regular file. > > $ sudo mkfs.ext4 /mnt/cnv/0 > $ sudo mount -o loop /mnt/cnv/0 /data > > The "seq" sub-directory grouping files for sequential write zones has > in this example 55356 zones. > > $ ls -lv /mnt/seq > total 14511243264 > -rw-r----- 1 root root 0 Nov 25 13:23 0 > -rw-r----- 1 root root 0 Nov 25 13:23 1 > -rw-r----- 1 root root 0 Nov 25 13:23 2 > ... > -rw-r----- 1 root root 0 Nov 25 13:23 55354 > -rw-r----- 1 root root 0 Nov 25 13:23 55355 > > For sequential write zone files, the file size changes as data is > appended at the end of the file, similarly to any regular file system. > > $ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct > 1+0 records in > 1+0 records out > 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s > > $ ls -l /mnt/seq/0 > -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 > > The written file can be truncated to the zone size, preventing any > further write operation. > > $ truncate -s 268435456 /mnt/seq/0 > $ ls -l /mnt/seq/0 > -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 > > Truncation to 0 size allows freeing the file zone storage space and > restart append-writes to the file. > > $ truncate -s 0 /mnt/seq/0 > $ ls -l /mnt/seq/0 > -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 > > Since files are statically mapped to zones on the disk, the number of > blocks of a file as reported by stat() and fstat() indicates the size > of the file zone. > > $ stat /mnt/seq/0 > File: /mnt/seq/0 > Size: 0 Blocks: 524288 IO Block: 4096 regular empty file > Device: 870h/2160d Inode: 50431 Links: 1 > Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2019-11-25 13:23:57.048971997 +0900 > Modify: 2019-11-25 13:52:25.553805765 +0900 > Change: 2019-11-25 13:52:25.553805765 +0900 > Birth: - > > The number of blocks of the file ("Blocks") in units of 512B blocks > gives the maximum file size of 524288 * 512 B = 256 MB, corresponding > to the device zone size in this example. Of note is that the "IO block" > field always indicates the minimum IO size for writes and corresponds > to the device physical sector size. > > This code contains contributions from: > * Johannes Thumshirn <jthumshirn@xxxxxxx>, > * Darrick J. Wong <darrick.wong@xxxxxxxxxx>, > * Christoph Hellwig <hch@xxxxxx>, > * Chaitanya Kulkarni <chaitanya.kulkarni@xxxxxxx> and > * Ting Yao <tingyao@xxxxxxxxxxx>. > > Signed-off-by: Damien Le Moal <damien.lemoal@xxxxxxx> > Reviewed-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > Reviewed-by: Johannes Thumshirn <johannes.thumshirn@xxxxxxx> > --- > MAINTAINERS | 9 + > fs/Kconfig | 1 + > fs/Makefile | 1 + > fs/zonefs/Kconfig | 9 + > fs/zonefs/Makefile | 4 + > fs/zonefs/super.c | 1177 ++++++++++++++++++++++++++++++++++++ > fs/zonefs/zonefs.h | 175 ++++++ > include/uapi/linux/magic.h | 1 + > 8 files changed, 1377 insertions(+) > create mode 100644 fs/zonefs/Kconfig > create mode 100644 fs/zonefs/Makefile > create mode 100644 fs/zonefs/super.c > create mode 100644 fs/zonefs/zonefs.h > Reviewed-by: Hannes Reinecke <hare@xxxxxxx> Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@xxxxxxx +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer