On 2019/12/16 17:19, Enrico Weigelt, metux IT consult wrote: > On 12.12.19 19:38, Damien Le Moal wrote: > > Hi, > >> zonefs is a very simple file system exposing each zone of a zoned block >> device as a file. Unlike a regular file system with zoned block device >> support (e.g. f2fs or the on-going btrfs effort), zonefs does not hide >> the sequential write constraint of zoned block devices to the user. > > Just curious: what's the exact definition of "zoned" here ? > Something like partitions ? As Carlos commented already, a zoned block device is Linux abstraction used to handle SMR HDDs (Shingled Magnetic Recording). These disks expose an LBA range that is divided into zones that can only be written sequentially for host-managed models. Other models such as host-aware or drive-managed allow random writes to all zones at the cost of potential serious performance degradation due to disk internal garbage collection of zones (similarly to an SSD handling of erase blocks). While today zoned block devices exist on the market only in the form of SMR disks, NVMe SSDs will also soon be available with the completion of the Zoned Namespace specifications. Zoning of block devices has several advantages: higher capacities for HDDs and more predictable and lower IO latencies for SSDs (almost no internal GC/weir leveling needed). But taking full advantage of these devices require software changes on the host due to the sequential write constraint imposed by the devices interface. > Can these files then also serve as block devices for other filesystems ? > Just a funny idea: could we handle partitions by a file system ? > > Even more funny idea: give file systems block device ops, so they can > be directly used as such (w/o explicitly using loopdev) ;-) This is outside the scope of this thread, so let's not start a discussion about this here. Start a new thread ! >> Files representing sequential write zones of the device must be written >> sequentially starting from the end of the file (append only writes). > > So, these files can only be accessed like a tape ? Writes must be sequential within a zone but reads can be random to any writen LBA. > Assuming you're working ontop of standard block devices anyways (instead > of tape-like media ;-)) - why introducing such a limitation ? See above: the limitation is physical, by the device, so that different improvements can be achieved depending on the storage medium being used (increased capacity, lower latencies, lower over provisioning, etc) > >> zonefs is not a POSIX compliant file system. It's goal is to simplify >> the implementation of zoned block devices support in applications by >> replacing raw block device file accesses with a richer file based API, >> avoiding relying on direct block device file ioctls which may >> be more obscure to developers. > > ioctls ? > > Last time I checked, block devices could be easily accessed via plain > file ops (read, write, seek, ...). You can basically treat them just > like big files of fixed size. I was not clear, my apologies. I am refering here to the zoned block device related ioctls defined in include/uapi/linux/blkzoned.h. These ioctls allow an application to manage the device zones (obtain zone information, erase zones, etc). These ioctls trigger issuing zone related commands to the device. These commands are defined by the ZBC and ZAC standards for SCSI and ATA, and NVMe Zoned Namespace in the very near future. >> One example of this approach is the >> implementation of LSM (log-structured merge) tree structures (such as >> used in RocksDB and LevelDB) > > The same LevelDB as used eg. in Chrome browser, which destroys itself > every time a little temporary problem (eg. disk full) occours ? > If that's the usecase I'd rather use an simple in-memory table instead > and and enough swap, as leveldb isn't reliable enough for persistent > data anyways :p The intent of my comment was not to advocate for or discuss the merits of any particular KV implementation. I was only pointing out that zonefs does not come in a void and that we do have use cases for it and did the work on some user space software to validate it. Leveldb and RocksDB are the 2 LSM-tree based KV stores we worked on as they are very popular and widely used. >> on zoned block devices by allowing SSTables >> to be stored in a zone file similarly to a regular file system rather >> than as a range of sectors of a zoned device. The introduction of the >> higher level construct "one file is one zone" can help reducing the >> amount of changes needed in the application while at the same time >> allowing the use of zoned block devices with various programming >> languages other than C. > > Why not just simply use files on a suited filesystem (w/ low block io > overhead) or LVM volumes ? Using a file system compliant with zoned block device constraint such as f2fs or btrfs (on-going work) is certainly a valid approach. However, this may not be the most optimal one if the application being used as a mostly sequential write behavior. LSM-tree based KV stores fall into this category: SSTables are large (several MB) and always written sequentially. There are not random writes, which facilitates supporting directly zoned block devices without the need for a file system which would add a GC background process and degrade performance. As mentioned in the cover letter, zonefs goal is to facilitate the implementation of this support compared toa pure raw block device use. > > > --mtx > -- Damien Le Moal Western Digital Research