Re: [PATCH RFC] fs: New zonefs file system

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Fri, 12 Jul 2019 22:56:55 +0000

On 2019/07/13 2:10, Viacheslav Dubeyko wrote:
> On Fri, 2019-07-12 at 12:00 +0900, Damien Le Moal wrote:
>> zonefs is a very simple file system exposing each zone of a zoned
>> block device as a file. This is intended to simplify implementation 
> 
> As far as I can see, a zone usually is pretty big in size (for example,
> 256MB). But [1, 2] showed that about 60% of files on a file system
> volume has size about 4KB - 128KB. Also [3] showed that modern
> application uses a very complex files' structures that are updated in
> random order. Moreover, [4] showed that 90% of all files are not used
> after initial creation, those that are used are normally short-lived,
> and that if a file is not used in some manner the day after it is
> created, it will probably never be used; 1% of all files are used
> daily.
> 
> It sounds for me that mostly this approach will lead to waste of zones'
> space. Also, the necessity to update data of the same file will be
> resulted in frequent moving of files' data from one zone to another
> one. If we are talking about SSDs then it sounds like quick and easy
> way to kill this device fast.
> 
> Do you have in mind some special use-case?

As the commit message mentions, zonefs is not a traditional file system by any
mean and much closer to a raw block device access interface than anything else.
This is the entire point of this exercise: allow replacing the raw block device
accesses with the easier to use file system API. Raw block device access is also
file API so one could argue that this is nonsense. What I mean here is that by
abstracting zones with files, the user does not need to do the zone
configuration discovery with ioctl(BLKREPORTZONES), does not need to do explicit
zone resets with ioctl(BLKRESETZONE), does not have to "start from one sector
and write sequentially from there" management for write() calls (i.e. seeks),
etc. This is all replaced with the file abstraction: directory entry list
replace zone information, truncate() replace zone reset, file current position
replaces the application zone write pointer management.

This simplifies implementing support of applications for zoned block devices,
but only in cases where said applications:
1) Operate with large files
2) have no or only minimal need for random writes

A perfect match for this as mentioned in the commit message are LSM-tree based
applications such as LevelDB or RocksDB. Other examples, related, include
Bluestore distributed object store which uses RocksDB but still has a bluefs
layer that could be replaced with zonefs.

As an illustration of this, Ting Yao of Huazhong University of Science and
Technology (China) and her team modified LevelDB to work with zonefs. The early
prototype code is on github here: https://github.com/PDS-Lab/GearDB/tree/zonefs

LSM-Tree applications typically operate on large files, in the same range as
zoned block device zone size (e.g. 256 MB or so). While this is generally a
parameter that can be changed, the use of zonefs and zoned block device forces
using the zone size as the SSTable file maximum size. This can have an impact on
the DB performance depending on the device type, but that is another discussion.
The point here is the code simplifications that zonefs allows.

For more general purpose use cases (small files, lots of random modifications),
we already have the dm-zoned device mapper and f2fs support and we are also
currently working on btrfs support. These solutions are in my opinion more
appropriate than zonefs to address the points you raised.

Best regards.

-- 
Damien Le Moal
Western Digital Research