Re: [PATCH RFC] fs: New zonefs file system

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Sat, 20 Jul 2019 01:07:25 +0000

On 2019/07/19 23:25, Jeff Moyer wrote:
> Hi, Damien,
> 
> Thanks for your well-considered response.
> 
> Damien Le Moal <Damien.LeMoal@xxxxxxx> writes:
> 
>> Jeff,
>>
>> On 2019/07/18 23:11, Jeff Moyer wrote:
>>> Hi, Damien,
>>>
>>> Did you consider creating a shared library?  I bet that would also
>>> ease application adoption for the use cases you're interested in, and
>>> would have similar performance.
>>>
>>> -Jeff
>>
>> Yes, it would, but to a lesser extent since system calls would need to be
>> replaced with library calls. Earlier work on LevelDB by Ting used the library
>> approach with libzbc, not quite a "libzonefs" but close enough. Working with
>> LevelDB code gave me the idea for zonefs. Compared to a library, the added
>> benefits are that specific language bindings are not a problem and further
>> simplify the code changes needed to support zoned block devices. In the case of
>> LevelDB for instance, C++ is used and file accesses are using streams, which
>> makes using a library a little difficult, and necessitates more changes just for
>> the internal application API itself. The needed changes spread beyond the device
>> access API.
>>
>> This is I think the main advantage of this simple in-kernel FS over a library:
>> the developer can focus on zone block device specific needs (write sequential
>> pattern and garbage collection) and forget about the device access parts as the
>> standard system calls API can be used.
> 
> OK, I can see how a file system eases adoption across multiple
> languages, and may, in some cases, be easier to adopt by applications.
> However, I'm not a fan of the file system interface for this usage.
> Once you present a file system, there are certain expectations from
> users, and this fs breaks most of them.

Yes, that is true. zonefs differs significantly from regular file systems. But I
would argue that breaking the users expectation is OK because that would happen
only and only if the user does not understand the hardware it is dealing with. I
still get emails regularly about mkfs.ext4 not working with SMR drives :)
In other words, since kernel 4.10 and exposure of HM SRM HDDs as regular block
device files, we already are in a sense breaking legacy user expectations
regarding the devices under device files... So I am not too worried about this
point.

If zonefs makes it into the kernel, I probably will be getting more emails about
"it does not work !" until SMR drive users out there understand what they are
dealing with. We are making a serious effort with documenting everything related
to zoned block devices. See https://zonedstorage.io. zonefs, if included in the
kernel, will be part of that documentation effort. Of note is that this
documentation is external to the kernel, we still need to increase our
contribution to the kernel docs for zoned block devices. And we will.

> I'll throw out another suggestion that may or may not work (I haven't
> given it much thought).  Would it be possible to create a device mapper
> target that would export each zone as a separate block device?  I
> understand that wouldn't help with the write pointer management, but it
> would allow you to create a single "file" for each zone.

Well, I do not think you need a new device mapper for this. dm-linear supports
zoned block devices and will happily allow mapping a single zone and expose a
block device file for it. My problem with this approach is that SMR drives are
huge, and getting bigger. A 15 TB drive has 55380 zones of 256 MB. Upcoming 20
TB drives have more than 75000 zones. Using dm-linear or any per-zone device
mapper target would create a huge resources pressure as the amount of memory
alone that would be used per zone would be much higher than with a file system
and the setup would also take far longer to complete compared to zonefs mount.

>> Another approach I considered is using FUSE, but went for a regular (albeit
>> simple) in-kernel approach due to performance concerns. While any difference in
>> performance for SMR HDDs would probably not be noticeable, performance would
>> likely be lower for upcoming NVMe zonenamespace devices compared to the
>> in-kernel approach.
>>
>> But granted, most of the arguments I can put forward for an in-kernel FS
>> solution vs a user shared library solution are mostly subjective. I think though
>> that having support directly provided by the kernel brings zoned block devices
>> into the "mainstream storage options" rather than having them perceived as
>> fringe solutions that need additional libraries to work correctly. Zoned block
>> devices are not going away and may in fact become more mainstream as
>> implementing higher capacities more and more depends on the sequential write
>> interface.
> 
> A file system like this would further cement in my mind that zoned block
> devices are not maintstream storage options.  I guess this part is
> highly subjective.  :)

Yes, it is subjective :) Many (even large scale) data centers are already
switching to "all SMR" backend storage, relying on traditional block devices
(SSDs mostly) for active data. For these systems, SMR is a mainstream solution.

When saying "mainstream", I was referring more to the software needed to use
these drives rather than the drives as a solution. zonefs allows mapping the
zone sequential write constraint to a known concept: file append write. And in
this sense, I think we can consider zonefs as progress.

Ideally, I would not bother with this at all and point people to Btrtfs (work in
progress) or XFS (next in the pipeline) for using an FS on zoned block devices.
But there is definitely a tendency for many distributed applications to try to
do without any file system at all (Ceph -> Bluestore is one example). And as
mentioned before, some other use cases where a fully POSIX compliant file system
is not really necessary at all (LevelDB, RocksDB). zonefs fits in the middle
ground here between removing the normal file system and going to raw block
device. Bluestore has "bluefs" underneath rocksdb after all. One cannot really
never go directly raw block device and in most cases some level of abstraction
(space management) is needed. That is where zonefs fits.

Best regards.

-- 
Damien Le Moal
Western Digital Research