Re: [LSF/MM/BPF TOPIC]: File system data placement for zoned block devices

Joel Granados <j.granados@xxxxxxxxxxx> · Tue, 14 Feb 2023 22:08:00 +0100



On Mon, Feb 06, 2023 at 01:41:49PM +0000, Hans Holmberg wrote:
> Write amplification induced by garbage collection negatively impacts
> both the performance and the life time for storage devices.
> 
> With zoned storage now standardized for SMR hard drives
> and flash(both NVME and UFS) we have an interface that allows
> us to reduce this overhead by adapting file systems to do
> better data placement.
I'm also very interested in discussions related to data placements. I am
interested in this discussion.

> 
> Background
> ----------
> 
> Zoned block devices enables the host to reduce the cost of
> reclaim/garbage collection/cleaning by exposing the media erase
> units as zones.
> 
> By filling up zones with data from files that will
> have roughly the same life span, garbage collection I/O
> can be minimized, reducing write amplification.
> Less disk I/O per user write.
> 
> Reduced amounts of garbage collection I/O improves
> user max read and write throughput and tail latencies, see [1].
> 
> Migrating out still-valid data to erase and reclaim unused
> capacity in e.g. NAND blocks has a significant performance
> cost. Unnecessarily moving data around also means that there
> will be more erase cycles per user write, reducing the life
> time of the media.
> 
> Current state
> -------------
> 
> To enable the performance benefits of zoned block devices
> a file system needs to:
> 
> 1) Comply with the write restrictions associated to the
> zoned device model. 
> 
> 2) Make active choices when allocating file data into zones
> to minimize GC.
> 
> Out of the upstream file systems, btrfs and f2fs supports
> the zoned block device model. F2fs supports active data placement
> by separating cold from hot data which helps in reducing gc,
> but there is room for improvement.
> 
> 
> There is still work to be done
> ------------------------------
> 
> I've spent a fair amount of time benchmarking btrfs and f2fs
> on top of zoned block devices along with xfs, ext4 and other
> file systems using the conventional block interface
> and at least for modern applicationsm, doing log-structured
> flash-friendly writes, much can be improved. 
> 
> A good example of a flash-friendly workload is RocksDB [6]
> which both does append-only writes and has a good prediction model
> for the life time of its files (due to its lsm-tree based data structures)
> 
> For RocksDB workloads, the cost of garbage collection can be reduced
> by 3x if near-optimal data placement is done (at 80% capacity usage).
> This is based on comparing ZenFS[2], a zoned storage file system plugin
> for RocksDB, with f2fs, xfs, ext4 and btrfs.
> 
> I see no good reason why linux kernel file systems (at least f2fs & btrfs)
> could not play as nice with these workload as ZenFS does, by just allocating
> file data blocks in a better way.
> 
> In addition to ZenFS we also have flex-alloc [5].
> There are probably more data placement schemes for zoned storage out there.
> 
> I think wee need to implement a scheme that is general-purpose enough
> for in-kernel file systems to cover a wide range of use cases and workloads.
> 
> I brought this up at LPC last year[4], but we did not have much time
> for discussions.
> 
> What is missing
> ---------------
> 
> Data needs to be allocated to zones in a way that minimizes the need for
> reclaim. Best-effort placement decision making could be implemented to place
> files of similar life times into the same zones.
> 
> To do this, file systems would have to utilize some sort of hint to
> separate data into different life-time-buckets and map those to
> different zones.
> 
> There is a user ABI for hints available - the write-life-time hint interface
> that was introduced for streams [3]. F2FS is the only user of this currently.
> 
> BTRFS and other file systems with zoned support could make use of it too,
> but it is limited to four, relative, life time values which I'm afraid would be too limiting when multiple users share a disk.
> 
> Maybe the life time hints could be combined with process id to separate
> different workloads better, maybe we need something else. F2FS supports
> cold/hot data separation based on file extension, which is another solution.
> 
> This is the first thing I'd like to discuss.
> 
> The second thing I'd like to discuss is testing and benchmarking, which
> is probably even more important and something that should be put into
> place first.
> 
> Testing/benchmarking
> --------------------
> 
> I think any improvements must be measurable, preferably without having to
> run live production application workloads.
> 
> Benchmarking and testing is generally hard to get right, and particularily hard
> when it comes to testing and benchmarking reclaim/garbage collection,
> so it would make sense to share the effort.
> 
> We should be able to use fio to model a bunch of application workloads
> that would benefit from data placement (lsm-tree based key-value database
> stores (e.g rocksdb, terarkdb), stream processing apps like Apache kafka)) .. 
> 
> Once we have a set of benchmarks that we collectively care about, I think we
> can work towards generic data placement methods with some level of
> confidence that it will actually work in practice.
> 
> Creating a repository with a bunch of reclaim/gc stress tests and benchmarks
> would be beneficial not only for kernel file systems but also for user-space
> and distributed file systems such as ceph.
> 
> Thanks,
> Hans
> 
> [1] https://www.usenix.org/system/files/atc21-bjorling.pdf
> [2] https://protect2.fireeye.com/v1/url?k=dce9fbc5-8372c2a0-dce8708a-000babff32e3-302a3cb629dc78ae&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Fgithub.com%2Fwesterndigitalcorporation%2Fzenfs
> [3] https://lwn.net/Articles/726477/
> [4] https://protect2.fireeye.com/v1/url?k=911c6738-ce875e5d-911dec77-000babff32e3-7bd289693aa18731&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Flpc.events%2Fevent%2F16%2Fcontributions%2F1231%2F
> [5] https://protect2.fireeye.com/v1/url?k=e4102d1c-bb8b1479-e411a653-000babff32e3-d07ddeaede7547d7&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Fgithub.com%2FOpenMPDK%2FFlexAlloc
> [6] https://protect2.fireeye.com/v1/url?k=1f7befc6-40e0d6a3-1f7a6489-000babff32e3-a7f3b118578d6c39&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb
Attachment:
signature.asc

Description: PGP signature