On Mon, Feb 06, 2023 at 01:41:49PM +0000, Hans Holmberg wrote: > Write amplification induced by garbage collection negatively impacts > both the performance and the life time for storage devices. > > With zoned storage now standardized for SMR hard drives > and flash(both NVME and UFS) we have an interface that allows > us to reduce this overhead by adapting file systems to do > better data placement. I'm also very interested in discussions related to data placements. I am interested in this discussion. > > Background > ---------- > > Zoned block devices enables the host to reduce the cost of > reclaim/garbage collection/cleaning by exposing the media erase > units as zones. > > By filling up zones with data from files that will > have roughly the same life span, garbage collection I/O > can be minimized, reducing write amplification. > Less disk I/O per user write. > > Reduced amounts of garbage collection I/O improves > user max read and write throughput and tail latencies, see [1]. > > Migrating out still-valid data to erase and reclaim unused > capacity in e.g. NAND blocks has a significant performance > cost. Unnecessarily moving data around also means that there > will be more erase cycles per user write, reducing the life > time of the media. > > Current state > ------------- > > To enable the performance benefits of zoned block devices > a file system needs to: > > 1) Comply with the write restrictions associated to the > zoned device model. > > 2) Make active choices when allocating file data into zones > to minimize GC. > > Out of the upstream file systems, btrfs and f2fs supports > the zoned block device model. F2fs supports active data placement > by separating cold from hot data which helps in reducing gc, > but there is room for improvement. > > > There is still work to be done > ------------------------------ > > I've spent a fair amount of time benchmarking btrfs and f2fs > on top of zoned block devices along with xfs, ext4 and other > file systems using the conventional block interface > and at least for modern applicationsm, doing log-structured > flash-friendly writes, much can be improved. > > A good example of a flash-friendly workload is RocksDB [6] > which both does append-only writes and has a good prediction model > for the life time of its files (due to its lsm-tree based data structures) > > For RocksDB workloads, the cost of garbage collection can be reduced > by 3x if near-optimal data placement is done (at 80% capacity usage). > This is based on comparing ZenFS[2], a zoned storage file system plugin > for RocksDB, with f2fs, xfs, ext4 and btrfs. > > I see no good reason why linux kernel file systems (at least f2fs & btrfs) > could not play as nice with these workload as ZenFS does, by just allocating > file data blocks in a better way. > > In addition to ZenFS we also have flex-alloc [5]. > There are probably more data placement schemes for zoned storage out there. > > I think wee need to implement a scheme that is general-purpose enough > for in-kernel file systems to cover a wide range of use cases and workloads. > > I brought this up at LPC last year[4], but we did not have much time > for discussions. > > What is missing > --------------- > > Data needs to be allocated to zones in a way that minimizes the need for > reclaim. Best-effort placement decision making could be implemented to place > files of similar life times into the same zones. > > To do this, file systems would have to utilize some sort of hint to > separate data into different life-time-buckets and map those to > different zones. > > There is a user ABI for hints available - the write-life-time hint interface > that was introduced for streams [3]. F2FS is the only user of this currently. > > BTRFS and other file systems with zoned support could make use of it too, > but it is limited to four, relative, life time values which I'm afraid would be too limiting when multiple users share a disk. > > Maybe the life time hints could be combined with process id to separate > different workloads better, maybe we need something else. F2FS supports > cold/hot data separation based on file extension, which is another solution. > > This is the first thing I'd like to discuss. > > The second thing I'd like to discuss is testing and benchmarking, which > is probably even more important and something that should be put into > place first. > > Testing/benchmarking > -------------------- > > I think any improvements must be measurable, preferably without having to > run live production application workloads. > > Benchmarking and testing is generally hard to get right, and particularily hard > when it comes to testing and benchmarking reclaim/garbage collection, > so it would make sense to share the effort. > > We should be able to use fio to model a bunch of application workloads > that would benefit from data placement (lsm-tree based key-value database > stores (e.g rocksdb, terarkdb), stream processing apps like Apache kafka)) .. > > Once we have a set of benchmarks that we collectively care about, I think we > can work towards generic data placement methods with some level of > confidence that it will actually work in practice. > > Creating a repository with a bunch of reclaim/gc stress tests and benchmarks > would be beneficial not only for kernel file systems but also for user-space > and distributed file systems such as ceph. > > Thanks, > Hans > > [1] https://www.usenix.org/system/files/atc21-bjorling.pdf > [2] https://protect2.fireeye.com/v1/url?k=dce9fbc5-8372c2a0-dce8708a-000babff32e3-302a3cb629dc78ae&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Fgithub.com%2Fwesterndigitalcorporation%2Fzenfs > [3] https://lwn.net/Articles/726477/ > [4] https://protect2.fireeye.com/v1/url?k=911c6738-ce875e5d-911dec77-000babff32e3-7bd289693aa18731&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Flpc.events%2Fevent%2F16%2Fcontributions%2F1231%2F > [5] https://protect2.fireeye.com/v1/url?k=e4102d1c-bb8b1479-e411a653-000babff32e3-d07ddeaede7547d7&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Fgithub.com%2FOpenMPDK%2FFlexAlloc > [6] https://protect2.fireeye.com/v1/url?k=1f7befc6-40e0d6a3-1f7a6489-000babff32e3-a7f3b118578d6c39&q=1&e=3a8688d2-8cbb-40fb-9107-11a07c4e64ea&u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb
Attachment:
signature.asc
Description: PGP signature