ext4: add free-space extent based allocator This patchset introduces a new multiblock allocator for the Ext4 file system based on an enhanced version of the free space management mechanism originally proposed by Kadekodi S. and Jain S., in their paper, "Taking Linux Filesystems to the Space Age: Space Maps in Ext4" [1]. High Level Design ================= This allocator maintains the free space information about the file system in per-flexible block group level trees. These trees are constructed from block bitmaps stored on disk. Therefore, this feature does not require any on-disk format change. For every flexible block group, we maintain individual freespace nodes in two trees, one sorted by flex-bg offset and other sorted by length. This gives us two benefits: i) In the allocation path, our search time complexity is O(Log(Number of freespace regions in the flex-bg)). ii) In free path, in the same time complexity we can merge the adjacent nodes thereby reducing the size of the tree efficiently. Along with flexible block group level trees, we also introduce a top level meta-tree which consists of individual flex-bg trees. This tree is sorted by length of largest extents found in flex-bgs. The key advantage that this tree gives us is this: During an allocation request, the allocator is now able to consult this tree and directly (in O(Log(Number of Flex BGs)) jump to a flexible block group which _definitely_ has at least one (the largest) extent that can satisfy this request. If no flexible block group exists which can satisfy this request, the allocator can now immediately drop the CR level. In order to preseve the parallel allocation / free performance, the allocator only *tries* to consult this tree. It does so by calling read_trylock() function and if the meta-tree is busy, the allocator continues its linear search until it is able to grab a read-lock on this tree. Memory Footprint ================ In a less fragmented file system, the memory footprint of the new allocator is significantly lower than buddy bitmaps. Memory footprint of the freespace tree allocator can be checked by running "cat /sys/fs/ext4/<dev>/frsp_tree_usage". For an almost newly created 100G disk, the total usage is only ~10 KB. However, as the fragmentation level increases, the memory footprint of this allocator may increase. The memory usage of the freespace tree allocator can be controlled using sysfs tunable "mb_frsp_max_mem". Once the memory threshold is reached, the allocator starts evicting the freespace trees in the LRU fashion from memory. However, we don't remove tree's entry from the meta-tree. This allows the allocator to efficiently reconstruct only relevant trees from on-disk bitmaps under high memory pressure. As we show in the evaluation section, freespace tree allocator still manages to outperform buddy allocator in memory crunch situation. The default value of mb_frsp_max_mem is max(memory needed for buddy, maximum memory needed for one tree in the worst case). Accounting for max memory needed for one tree allows us to keep at least one tree in memory even in the worst case. This avoids thrashing. Caching Tree Metadata ===================== As noted in our experiments, we find caching tree metadata (the largest available extent in a tree) in the meta-tree significantly boosts allocation performance. However, if the allocator waits for the cache to fill up before starting to serve allocation requests, that may severely degrade allocation performance on large disks. Thus, it is important to tune the tree caching behavior according to the underlying block device. This patchset provides four cache aggression levels. Cache aggression level can be specified as a mount time parameter "frsp_cache_aggression". Here is the meaning of these different levels: |-------+------------------------------------------------------------------| | Level | Meaning | |-------+------------------------------------------------------------------| | 0 | Try to avoid caching as much as possible. In this mode | | | the allocator tries hard to serve the request from the already | | | cached trees. | |-------+------------------------------------------------------------------| | 1 | Avoid caching at CR=0. Otherwise, cache trees on every other | | | allocation request. (default) | |-------+------------------------------------------------------------------| | 2 | Cache trees on every other allocation request. | |-------+------------------------------------------------------------------| | 3 | Aggressive caching. In this mode the allocator aggressively | | | caches uncached trees, even if the request can be fulfilled | | | by one of the available trees. Using this mode, the allocator | | | ends up caching trees quickly and thereby is able to make better | | | allocation decisions sooner. | |-------+------------------------------------------------------------------| Evaluation ========== This feature did not introduce any regressions in Ext4 XFStests in quick group. We created a _highly_ fragmented 60T disk with over 490K block groups and over 30K flexible block groups. We tested the write performance of the first small write (10MB) and a larger second write (100MB) using both the buddy allocator and freespace tree allocator. We turned memory limit off for the first four free space tree configurations. First Write Performance ----------------------- |--------------------------------------+---------+---------+-----------+--------| | Allocator | RAM | # trees | Perf | Time | | | usage | cached | | | |--------------------------------------+---------+---------+-----------+--------| | Buddy Allocator (v5.7) | - | - | 6.8 KB/s | 25m51s | |--------------------------------------+---------+---------+-----------+--------| | With --prefetch_block_bitmap | - | - | 28.1 KB/s | 6m14s | |--------------------------------------+---------+---------+-----------+--------| | Free space tree allocator at level 0 | 43.5 MB | 201 | 2.6 MB/s | 0m8s | |--------------------------------------+---------+---------+-----------+--------| | Free space tree allocator at level 1 | 193 MB | 874 | 1.5 MB/s | 0m47s | |--------------------------------------+---------+---------+-----------+--------| | Free space tree allocator at level 2 | 3.6 GB | 16476 | 21.4 KB/s | 8m14s | |--------------------------------------+---------+---------+-----------+--------| | Free space tree allocator at level 3 | 6.8 GB | 30720 | 9.1 KB/s | 19m22s | |--------------------------------------+---------+---------+-----------+--------| | Free space tree allocator at level 3 | 1023 MB | 30720 | 9.3 KB/s | 18m53s | | ( With 1G Limit) | | | | | |--------------------------------------+---------+---------+-----------+--------| Second Write Performance ------------------------ |--------------------------------------+---------+---------+-----------+-------| | Allocator | RAM | # trees | Perf | Time | | | usage | cached | | | |--------------------------------------+---------+---------+-----------+-------| | Buddy Allocator (v5.7) | - | - | 499 KB/s | 3m30s | |--------------------------------------+---------+---------+-----------+-------| | With --prefetch_block_bitmap | - | - | 185 KB/s | 9m26s | |--------------------------------------+---------+---------+-----------+-------| | Free space tree allocator at level 0 | 48.7 MB | 226 | 48.2 MB/s | 6s | |--------------------------------------+---------+---------+-----------+-------| | Free space tree allocator at level 1 | 221 MB | 1007 | 26.8 MB/s | 8s | |--------------------------------------+---------+---------+-----------+-------| | Free space tree allocator at level 2 | 6.8 GB | 30720 | 178 KB/s | 9m54s | |--------------------------------------+---------+---------+-----------+-------| | Free space tree allocator at level 3 | 6.8 GB | 30720 | 79 MB/s | 5s | |--------------------------------------+---------+---------+-----------+-------| | Free space tree allocator at level 3 | 1023 MB | 30720 | 77.1 MB/s | 7s | | ( With 1G Limit) | | | | | |--------------------------------------+---------+---------+-----------+-------| Verified that parallel write performance on a newly created disk is not very different for buddy allocator and freespace tree allocator. This patchset is applied on top of block bitmap prefetch patches. Signed-off-by: Yuexin Li <lyx1209@xxxxxxxxx> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> [1] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf Changes Since V1: - Rebased the patchset on top of ext4 dev branch - Converted the patches to RFC Harshad Shirwadkar (8): ext4: add handling for extended mount options ext4: rename ext4_mb_load_buddy to ext4_mb_load_allocator ext4: add prefetching support for freespace trees ext4: add freespace tree optimizations ext4: add memory usage tracker for freespace trees ext4: add LRU eviction for free space trees ext4: add tracepoints for free space trees ext4: add freespace trees documentation in code Yuexin Li (1): ext4: add freespace tree allocator fs/ext4/ext4.h | 117 +++ fs/ext4/mballoc.c | 1541 +++++++++++++++++++++++++++++++++-- fs/ext4/mballoc.h | 67 +- fs/ext4/resize.c | 8 + fs/ext4/super.c | 60 +- fs/ext4/sysfs.c | 11 + include/trace/events/ext4.h | 112 +++ 7 files changed, 1821 insertions(+), 95 deletions(-) -- 2.28.0.297.g1956fa8f8d-goog