On 2019/07/13 2:10, Viacheslav Dubeyko wrote: > On Fri, 2019-07-12 at 12:00 +0900, Damien Le Moal wrote: >> zonefs is a very simple file system exposing each zone of a zoned >> block device as a file. This is intended to simplify implementation > > As far as I can see, a zone usually is pretty big in size (for example, > 256MB). But [1, 2] showed that about 60% of files on a file system > volume has size about 4KB - 128KB. Also [3] showed that modern > application uses a very complex files' structures that are updated in > random order. Moreover, [4] showed that 90% of all files are not used > after initial creation, those that are used are normally short-lived, > and that if a file is not used in some manner the day after it is > created, it will probably never be used; 1% of all files are used > daily. > > It sounds for me that mostly this approach will lead to waste of zones' > space. Also, the necessity to update data of the same file will be > resulted in frequent moving of files' data from one zone to another > one. If we are talking about SSDs then it sounds like quick and easy > way to kill this device fast. > > Do you have in mind some special use-case? As the commit message mentions, zonefs is not a traditional file system by any mean and much closer to a raw block device access interface than anything else. This is the entire point of this exercise: allow replacing the raw block device accesses with the easier to use file system API. Raw block device access is also file API so one could argue that this is nonsense. What I mean here is that by abstracting zones with files, the user does not need to do the zone configuration discovery with ioctl(BLKREPORTZONES), does not need to do explicit zone resets with ioctl(BLKRESETZONE), does not have to "start from one sector and write sequentially from there" management for write() calls (i.e. seeks), etc. This is all replaced with the file abstraction: directory entry list replace zone information, truncate() replace zone reset, file current position replaces the application zone write pointer management. This simplifies implementing support of applications for zoned block devices, but only in cases where said applications: 1) Operate with large files 2) have no or only minimal need for random writes A perfect match for this as mentioned in the commit message are LSM-tree based applications such as LevelDB or RocksDB. Other examples, related, include Bluestore distributed object store which uses RocksDB but still has a bluefs layer that could be replaced with zonefs. As an illustration of this, Ting Yao of Huazhong University of Science and Technology (China) and her team modified LevelDB to work with zonefs. The early prototype code is on github here: https://github.com/PDS-Lab/GearDB/tree/zonefs LSM-Tree applications typically operate on large files, in the same range as zoned block device zone size (e.g. 256 MB or so). While this is generally a parameter that can be changed, the use of zonefs and zoned block device forces using the zone size as the SSTable file maximum size. This can have an impact on the DB performance depending on the device type, but that is another discussion. The point here is the code simplifications that zonefs allows. For more general purpose use cases (small files, lots of random modifications), we already have the dm-zoned device mapper and f2fs support and we are also currently working on btrfs support. These solutions are in my opinion more appropriate than zonefs to address the points you raised. Best regards. -- Damien Le Moal Western Digital Research