On Wed, Jul 17, 2024 at 11:45:48AM GMT, Ryan Roberts wrote: > On 17/07/2024 11:31, David Hildenbrand wrote: > > On 17.07.24 09:12, Ryan Roberts wrote: > >> Hi All, > >> > >> This series is an RFC that adds sysfs and kernel cmdline controls to configure > >> the set of allowed large folio sizes that can be used when allocating > >> file-memory for the page cache. As part of the control mechanism, it provides > >> for a special-case "preferred folio size for executable mappings" marker. > >> > >> I'm trying to solve 2 separate problems with this series: > >> > >> 1. Reduce pressure in iTLB and improve performance on arm64: This is a modified > >> approach for the change at [1]. Instead of hardcoding the preferred executable > >> folio size into the arch, user space can now select it. This decouples the arch > >> code and also makes the mechanism more generic; it can be bypassed (the default) > >> or any folio size can be set. For my use case, 64K is preferred, but I've also > >> heard from Willy of a use case where putting all text into 2M PMD-sized folios > >> is preferred. This approach avoids the need for synchonous MADV_COLLAPSE (and > >> therefore faulting in all text ahead of time) to achieve that. > >> > >> 2. Reduce memory fragmentation in systems under high memory pressure (e.g. > >> Android): The theory goes that if all folios are 64K, then failure to allocate a > >> 64K folio should become unlikely. But if the page cache is allocating lots of > >> different orders, with most allocations having an order below 64K (as is the > >> case today) then ability to allocate 64K folios diminishes. By providing control > >> over the allowed set of folio sizes, we can tune to avoid crucial 64K folio > >> allocation failure. Additionally I've heard (second hand) of the need to disable > >> large folios in the page cache entirely due to latency concerns in some > >> settings. These controls allow all of this without kernel changes. > >> > >> The value of (1) is clear and the performance improvements are documented in > >> patch 2. I don't yet have any data demonstrating the theory for (2) since I > >> can't reproduce the setup that Barry had at [2]. But my view is that by adding > >> these controls we will enable the community to explore further, in the same way > >> that the anon mTHP controls helped harden the understanding for anonymous > >> memory. > >> > >> --- > > > > How would this interact with other requirements we get from the filesystem (for > > example, because of the device) [1]. > > > > Assuming a device has a filesystem has a min order of X, but we disable anything > >>= X, how would we combine that configuration/information? > > Currently order-0 is implicitly the "always-on" fallback order. My thinking was > that with [1], the specified min order just becomes that "always-on" fallback order. > > Today: > > orders = file_orders_always() | BIT(0); > > Tomorrow: > > orders = (file_orders_always() & ~(BIT(min_order) - 1)) | BIT(min_order); > > That does mean that in this case, a user-disabled order could still be used. So > the controls are really hints rather than definitive commands. In the scenario where a min order is not enabled in hugepages-<size>kB/ file_enabled, will the user still be allowed to automatically mkfs/mount with blocksize=min_order, and will sysfs reflect this? Or, since it's a hint, will it remain hidden but still allow mkfs/mount to proceed? > > > > > > > > [1] > > https://lore.kernel.org/all/20240715094457.452836-2-kernel@xxxxxxxxxxxxxxxx/T/#u > > >