On 13/07/2024 11:45, Ryan Roberts wrote: > On 13/07/2024 02:08, David Hildenbrand wrote: >> On 12.07.24 14:22, Lance Yang wrote: >>> On Fri, Jul 12, 2024 at 11:00 AM Baolin Wang >>> <baolin.wang@xxxxxxxxxxxxxxxxx> wrote: >>>> >>>> >>>> >>>> On 2024/7/11 15:29, Ryan Roberts wrote: >>>>> Expose 3 new mTHP stats for file (pagecache) folio allocations: >>>>> >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback >>>>> >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge >>>>> >>>>> This will provide some insight on the sizes of large folios being >>>>> allocated for file-backed memory, and how often allocation is failing. >>>>> >>>>> All non-order-0 (and most order-0) folio allocations are currently done >>>>> through filemap_alloc_folio(), and folios are charged in a subsequent >>>>> call to filemap_add_folio(). So count file_fallback when allocation >>>>> fails in filemap_alloc_folio() and count file_alloc or >>>>> file_fallback_charge in filemap_add_folio(), based on whether charging >>>>> succeeded or not. There are some users of filemap_add_folio() that >>>>> allocate their own order-0 folio by other means, so we would not count >>>>> an allocation failure in this case, but we also don't care about order-0 >>>>> allocations. This approach feels like it should be good enough and >>>>> doesn't require any (impractically large) refactoring. >>>>> >>>>> The existing mTHP stats interface is reused to provide consistency to >>>>> users. And because we are reusing the same interface, we can reuse the >>>>> same infrastructure on the kernel side. The one small wrinkle is that >>>>> the set of folio sizes supported by the pagecache are not identical to >>>>> those supported by anon and shmem; pagecache supports order-1, unlike >>>>> anon and shmem, and the max pagecache order may be less than PMD-size >>>>> (see arm64 with 64K base pages), again unlike anon and shmem. So we now >>>>> create a hugepages-*kB directory for the union of the sizes supported by >>>>> all 3 memory types and populate it with the relevant stats and controls. >>>> >>>> Personally, I like the idea that can help analyze the allocation of >>>> large folios for the page cache. >>>> >>>> However, I have a slight concern about the consistency of the interface. >>>> >>>> For 64K, the fields layout: >>>> ├── hugepages-64kB >>>> │ ├── enabled >>>> │ ├── shmem_enabled >>>> │ └── stats >>>> │ ├── anon_fault_alloc >>>> │ ├── anon_fault_fallback >>>> │ ├── anon_fault_fallback_charge >>>> │ ├── file_alloc >>>> │ ├── file_fallback >>>> │ ├── file_fallback_charge >>>> │ ├── shmem_alloc >>>> │ ├── shmem_fallback >>>> │ ├── shmem_fallback_charge >>>> │ ├── split >>>> │ ├── split_deferred >>>> │ ├── split_failed >>>> │ ├── swpout >>>> │ └── swpout_fallback >>>> >>>> But for 8K (for pagecache), you removed some fields (of course, I >>>> understand why they are not supported). >>>> >>>> ├── hugepages-8kB >>>> │ └── stats >>>> │ ├── file_alloc >>>> │ ├── file_fallback >>>> │ └── file_fallback_charge >>>> >>>> This might not be user-friendly for some user-space parsing tools, as >>>> they lack certain fields for the same pattern interfaces. Of course, >>>> this might not be an issue if we have clear documentation describing the >>>> differences here:) >>>> >>>> Another possible approach is to maintain the same field layout to keep >>>> consistent, but prohibit writing to the fields that are not supported by >>>> the pagecache, and any stats read from them would be 0. >>> >>> I agree that maintaining a uniform field layout, especially at the stats >>> level, might be necessary ;) >>> >>> Keeping a consistent interface could future-proof the design. It allows >>> for the possibility that features not currently supported for 8kB pages >>> might be enabled in the future. >> >> I'll just note that, with shmem/file effectively being disabled for order > 11, >> we'll also have entries there that are effectively unused. > > Indeed, I mentioned that in the commit log :) > >> >> Good question how we want to deal with that (stats are easy, but what about when >> we enable something? Maybe we should document that "enabled" is only effective >> when supported). > > The documentation already says "If enabling multiple hugepage sizes, the kernel > will select the most appropriate enabled size for a given allocation." for anon > THP (and I've added similar wording for my as-yet-unposted patch to add controls > for page cache folio sizes). So I think we could easily add dummy *enabled > controls for all sizes, that can be written to and read back consistently, but > the kernel just ignores them when deciding what size to use. It would also > simplify the code that populates the controls. > > Personally though, I'm not convinced of the value of trying to make the controls > for every size look identical. What's the real value to the user to pretend that > they can select a size that they cannot? What happens when we inevitably want to > add some new control in future which only applies to select sizes and there is > no good way to fake it for the other sizes? Why can't user space just be > expected to rely on the existance of the files rather than on the existance of > the directories? > > As always, I'll go with the majority, but just wanted to register my opinion. Should I assume from the lack of reply on this that everyone else is in favour of adding dummy controls so that all sizes have the same set of controls? If I don't hear anything further, I'll post v2 with dummry controls today or tomorrow. > > Thanks, > Ryan > >> >> Hmmmmm >> >