Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices

Hans Holmberg <hans@xxxxxxxxxxxxx> · Wed, 8 Feb 2023 10:16:07 +0100

On Tue, Feb 7, 2023 at 6:46 PM Boris Burkov <boris@xxxxxx> wrote:
>
> On Tue, Feb 07, 2023 at 01:31:44PM +0100, Hans Holmberg wrote:
> > On Mon, Feb 6, 2023 at 3:24 PM Johannes Thumshirn
> > <Johannes.Thumshirn@xxxxxxx> wrote:
> > >
> > > On 06.02.23 14:41, Hans Holmberg wrote:
> > > > Out of the upstream file systems, btrfs and f2fs supports
> > > > the zoned block device model. F2fs supports active data placement
> > > > by separating cold from hot data which helps in reducing gc,
> > > > but there is room for improvement.
> > >
> > > FYI, there's a patchset [1] from Boris for btrfs which uses different
> > > size classes to further parallelize placement. As of now it leaves out
> > > ZNS drives, as this can clash with the MOZ/MAZ limits but once active
> > > zone tracking is fully bug free^TM we should look into using these
> > > allocator hints for ZNS as well.
> > >
> >
> > That looks like a great start!
> >
> > Via that patch series I also found Josef's fsperf repo [1], which is
> > exactly what I have
> > been looking for: a set of common tests for file system performance. I hope that
> > it can be extended with longer-running tests doing several disk overwrites with
> > application-like workloads.
>
> It should be relatively straightforward to add more tests to fsperf and
> we are happy to take new workloads! Also, feel free to shoot me any
> questions you run into while working on it and I'm happy to help.

Great,, thanks!

>
> >
> > > The hot/cold data can be a 2nd placement hint, of cause, not just the
> > > different size classes of an extent.
> >
> > Yes. I'll dig into the patches and see if I can figure out how that
> > could be done.
>
> FWIW, I was working on reducing fragmentation/streamlining reclaim for
> non zoned btrfs. I have another patch set that I am still working on
> which attempts to use a working set concept to make placement
> lifetime/lifecycle a bigger part of the btrfs allocator.
>
> That patch set tries to make btrfs write faster in parallel, which may
> be against what you are going for, that I'm not sure of. Also, I didn't
> take advantage of the lifetime hints because I wanted it to help for the
> general case, but that could be an interesting direction too!

I'll need to dig into your patchset and look deeper into the btrfs allocator
code, to know for sure, but reducing fragmentation is great for zoned storage
in general.

Filling up zones with data from a single file is the easiest way to reduce
write amplification, and  the optimal from a reclaim perspective.
Entire zone(s) can be reclaimed as soon as the file is deleted.

This works great for lsm-tree based workloads like rocksdb and should
work well for other applications using copy-on-write data structures with
configurable file sizes (like apache kafka [1], which uses 1 gigabyte log
file sizes per default)

When file data from several files needs to be co-located in the same zone
things get more complicated. Then we have to co-locate file data from
more than one file, trying to match up files data that have the same life span.

If the user can tell us about the expected data life time via a hint,
that is great. If the
file system does not have that information, some other heuristic is needed,
like assuming that data being written by different processes or
users/groups have
different life spans. A more advanced scheme, SepBIT [2], has been proposed
for block storage, which may be applicable for file system data as well.

Thanks,
Hans

[1] https://kafka.apache.org/documentation/#topicconfigs_segment.bytes
[2] http://adslab.cse.cuhk.edu.hk/pubs/tech_sepbit.pdf

> If you're curious about that work, the current state of the patches is
> in this branch:
> https://github.com/kdave/btrfs-devel/compare/misc-next...boryas:linux:bg-ws
> (Johannes, those are the patches I worked on after you noticed the
> allocator being slow with many disks.)
>
> Boris
>
> >
> > Cheers,
> > Hans
> >
> > [1] https://github.com/josefbacik/fsperf