Re: [PATCH v1 2/2] mm: mTHP stats for pagecache folio allocations

David Hildenbrand <david@xxxxxxxxxx> · Tue, 16 Jul 2024 12:19:30 +0200

On 16.07.24 10:31, Ryan Roberts wrote:
On 13/07/2024 11:45, Ryan Roberts wrote:
On 13/07/2024 02:08, David Hildenbrand wrote:
On 12.07.24 14:22, Lance Yang wrote:
On Fri, Jul 12, 2024 at 11:00 AM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:

On 2024/7/11 15:29, Ryan Roberts wrote:
Expose 3 new mTHP stats for file (pagecache) folio allocations:

     /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc
     /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge

This will provide some insight on the sizes of large folios being
allocated for file-backed memory, and how often allocation is failing.

All non-order-0 (and most order-0) folio allocations are currently done
through filemap_alloc_folio(), and folios are charged in a subsequent
call to filemap_add_folio(). So count file_fallback when allocation
fails in filemap_alloc_folio() and count file_alloc or
file_fallback_charge in filemap_add_folio(), based on whether charging
succeeded or not. There are some users of filemap_add_folio() that
allocate their own order-0 folio by other means, so we would not count
an allocation failure in this case, but we also don't care about order-0
allocations. This approach feels like it should be good enough and
doesn't require any (impractically large) refactoring.

The existing mTHP stats interface is reused to provide consistency to
users. And because we are reusing the same interface, we can reuse the
same infrastructure on the kernel side. The one small wrinkle is that
the set of folio sizes supported by the pagecache are not identical to
those supported by anon and shmem; pagecache supports order-1, unlike
anon and shmem, and the max pagecache order may be less than PMD-size
(see arm64 with 64K base pages), again unlike anon and shmem. So we now
create a hugepages-*kB directory for the union of the sizes supported by
all 3 memory types and populate it with the relevant stats and controls.

Personally, I like the idea that can help analyze the allocation of
large folios for the page cache.

However, I have a slight concern about the consistency of the interface.

For 64K, the fields layout:
├── hugepages-64kB
│   ├── enabled
│   ├── shmem_enabled
│   └── stats
│       ├── anon_fault_alloc
│       ├── anon_fault_fallback
│       ├── anon_fault_fallback_charge
│       ├── file_alloc
│       ├── file_fallback
│       ├── file_fallback_charge
│       ├── shmem_alloc
│       ├── shmem_fallback
│       ├── shmem_fallback_charge
│       ├── split
│       ├── split_deferred
│       ├── split_failed
│       ├── swpout
│       └── swpout_fallback

But for 8K (for pagecache), you removed some fields (of course, I
understand why they are not supported).

├── hugepages-8kB
│   └── stats
│       ├── file_alloc
│       ├── file_fallback
│       └── file_fallback_charge

This might not be user-friendly for some user-space parsing tools, as
they lack certain fields for the same pattern interfaces. Of course,
this might not be an issue if we have clear documentation describing the
differences here:)

Another possible approach is to maintain the same field layout to keep
consistent, but prohibit writing to the fields that are not supported by
the pagecache, and any stats read from them would be 0.

I agree that maintaining a uniform field layout, especially at the stats
level, might be necessary ;)

Keeping a consistent interface could future-proof the design. It allows
for the possibility that features not currently supported for 8kB pages
might be enabled in the future.

I'll just note that, with shmem/file effectively being disabled for order > 11,
we'll also have entries there that are effectively unused.

Indeed, I mentioned that in the commit log :)

Well, I think it's more extreme than what you mentioned.

For example, shmem_enable on arm64 with 64k is now effectively 
non-functional. Just like it will be for other orders in the anon-shmem 
case when the order exceeds MAX_PAGECACHE_ORDER.

Good question how we want to deal with that (stats are easy, but what about when
we enable something? Maybe we should document that "enabled" is only effective
when supported).

The documentation already says "If enabling multiple hugepage sizes, the kernel
will select the most appropriate enabled size for a given allocation." for anon
THP (and I've added similar wording for my as-yet-unposted patch to add controls
for page cache folio sizes). So I think we could easily add dummy *enabled
controls for all sizes, that can be written to and read back consistently, but
the kernel just ignores them when deciding what size to use. It would also
simplify the code that populates the controls.

Personally though, I'm not convinced of the value of trying to make the controls
for every size look identical. What's the real value to the user to pretend that
they can select a size that they cannot? What happens when we inevitably want to
add some new control in future which only applies to select sizes and there is
no good way to fake it for the other sizes? Why can't user space just be
expected to rely on the existance of the files rather than on the existance of
the directories?

As always, I'll go with the majority, but just wanted to register my opinion.

Should I assume from the lack of reply on this that everyone else is in favour
of adding dummy controls so that all sizes have the same set of controls? If I
don't hear anything further, I'll post v2 with dummry controls today or tomorrow.

Sorry, busy with other stuff.

Indicating only what really exists sounds cleaner. But I wonder how we 
would want to handle in general orders that are effectively non-existant?

--
Cheers,

David / dhildenb