> On Jan 5, 2024, at 11:27 AM, Liam Howlett <liam.howlett@xxxxxxxxxx> wrote: > > * Chuck Lever III <chuck.lever@xxxxxxxxxx> [240104 14:33]: >> >> >>> On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >>> >>> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote: >>>>> Well that's the problem. Since I can't run the reproducer, there's >>>>> nothing I can do to troubleshoot the problem myself. >>>> >>>> We dug more into the perf and other profiling data from 0Day server >>>> running this case, and it seems that the new simple_offset_add() >>>> called by shmem_mknod() brings extra cost related with slab, >>>> specifically the 'radix_tree_node', which cause the regression. >>>> >>>> Here is some slabinfo diff for commit a2e459555c5f and its parent: >>>> >>>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 >>>> ---------------- --------------------------- >>>> >>>> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs >>>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs >>>> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs >>>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs >>> >>> I can't find the benchmark source, but my suspicion is that this >>> creates and deletes a lot of files in a directory. The 'stable >>> directory offsets' series uses xa_alloc_cyclic(), so we'll end up >>> with a very sparse radix tree. ie it'll look something like this: >>> >>> 0 - "." >>> 1 - ".." >>> 6 - "d" >>> 27 - "y" >>> 4000 - "fzz" >>> 65537 - "czzz" >>> 643289767 - "bzzzzzz" >>> >>> (i didn't work out the names precisely here, but this is approximately >>> what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete >>> almost all of them) >>> >>> The radix tree does not handle this well. It'll allocate one node for: >>> >>> entries 0-63 (covers the first 4 entries) >>> entries 0-4095 >>> entries 3968-4031 (the first 5) >>> entries 0-262143 >>> entries 65536-69631 >>> entries 65536-65599 (the first 6) >>> entries 0-16777215 >>> entries 0-1073741823 >>> entries 637534208-654311423 >>> entries 643039232-643301375 >>> entries 643289088-643293183 >>> entries 643289728-643289791 (all 7) >>> >>> That ends up being 12 nodes (you get 7 nodes per page) to store 7 >>> pointers. Admittedly to get here, you have to do 643289765 creations >>> and nearly as many deletions, so are we going to see it in a >>> non-benchmark situation? >>> >>> The maple tree is more resilient against this kind of shenanigan, but >>> we're not there in terms of supporting the kind of allocation you >>> want. For this kind of allocation pattern, you'd get all 7 pointers >>> in a single 256-byte node. >> >> Hello Matthew, it's been a couple of kernel releases, so >> following up. >> >> Is Maple tree ready for libfs to use it for managing directory >> offsets? > > The feature you are looking for is dense nodes. It will allow for > a compact tree when you have a number of single indexes mapping to > entries (ideal for many ranges of 1). > > I'm actively working on dense nodes, which will yield 31 entries per > node when they are single index mappings. I'm hoping to have it > completed in the next few weeks and start beating it up with tests > before pushing it out. > >> >> Should we just go for broke and convert libfs from xarray to >> Maple tree now? > > We are trying to keep the API close for both the xarray and the maple > tree, so if you do the conversion to one then switching shouldn't be > much work. I'd try the maple tree to see if the performance is > acceptable today (I may be biased), but I don't know how big of an > effort this conversion would entail. > > The maple tree will compress the NULL indexes to a single entry of NULL. > My main concern is the density of information and the number of > allocations the tree will do to keep up with the workload - both will > improve with the dense nodes feature. > > If you convert to maple tree, you will get the update for free later as > the node type the tree chooses will be transparent to users. > > If you need tagging then you should use the xarray as I haven't started > that feature yet - but I don't think you need that? I don't recall using xarray tags for directory offset mapping. > I also noticed that Matthew mentioned xa_alloc_cyclic() as the potential > call into the xarray. The maple tree counterpart isn't used much today > and may need to be optimised. If you can verify what this test does, we > could produce a test case for the maple tree test suite and optimise if > necessary. > > Let us know if you have any other questions or need some pointers on how > to get started with a conversion. Sounds like conversion is worth starting on, at least. I'll try to clear some time to work on it. -- Chuck Lever