Re: [RFC PATCH v3 0/4] Support large folios for tmpfs

David Hildenbrand <david@xxxxxxxxxx> · Wed, 23 Oct 2024 11:27:10 +0200

On 23.10.24 10:04, Baolin Wang wrote:

On 2024/10/22 23:31, David Hildenbrand wrote:
On 22.10.24 05:41, Baolin Wang wrote:

On 2024/10/21 21:34, Daniel Gomez wrote:
On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:

On 2024/10/17 19:26, Kirill A. Shutemov wrote:
On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
+ Kirill

On 2024/10/16 22:06, Matthew Wilcox wrote:
On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
Considering that tmpfs already has the 'huge=' option to
control the THP
allocation, it is necessary to maintain compatibility with the
'huge='
option, as well as considering the 'deny' and 'force' option
controlled
by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.

No, it's not.  No other filesystem honours these settings.
tmpfs would
not have had these settings if it were written today.  It should
simply
ignore them, the way that NFS ignores the "intr" mount option
now that
we have a better solution to the original problem.

To reiterate my position:

      - When using tmpfs as a filesystem, it should behave like
other
        filesystems.
      - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED,
it should
        behave like anonymous memory.

I do agree with your point to some extent, but the ‘huge=’ option
has
existed for nearly 8 years, and the huge orders based on write
size may not
achieve the performance of PMD-sized THP in some scenarios, such
as when the
write length is consistently 4K. So, I am still concerned that
ignoring the
'huge' option could lead to compatibility issues.

Yeah, I don't think we are there yet to ignore the mount option.

OK.

Maybe we need to get a new generic interface to request the semantics
tmpfs has with huge= on per-inode level on any fs. Like a set of
FADV_*
handles to make kernel allocate PMD-size folio on any allocation
or on
allocations within i_size. I think this behaviour is useful beyond
tmpfs.

Then huge= implementation for tmpfs can be re-defined to set these
per-inode FADV_ flags by default. This way we can keep tmpfs
compatible
with current deployments and less special comparing to rest of
filesystems on kernel side.

I did a quick search, and I didn't find any other fs that require
PMD-sized
huge pages, so I am not sure if FADV_* is useful for filesystems
other than
tmpfs. Please correct me if I missed something.

What do you mean by "require"? THPs are always opportunistic.

IIUC, we don't have a way to hint kernel to use huge pages for a
file on
read from backing storage. Readahead is not always the right way.

If huge= is not set, tmpfs would behave the same way as the rest of
filesystems.

So if 'huge=' is not set, tmpfs write()/fallocate() can still
allocate large
folios based on the write size? If yes, that means it will change the
default huge behavior for tmpfs. Because previously having 'huge='
is not
set means the huge option is 'SHMEM_HUGE_NEVER', which is similar
to what I
mentioned:
"Another possible choice is to make the huge pages allocation based
on write
size as the *default* behavior for tmpfs, ..."

I am more worried about breaking existing users of huge pages. So
changing
behaviour of users who don't specify huge is okay to me.

I think moving tmpfs to allocate large folios opportunistically by
default (as it was proposed initially) doesn't necessary conflict with
the default behaviour (huge=never). We just need to clarify that in
the documentation.

However, and IIRC, one of the requests from Hugh was to have a way to
disable large folios which is something other FS do not have control
of as of today. Ryan sent a proposal to actually control that globally
but I think it didn't move forward. So, what are we missing to go back
to implement large folios in tmpfs in the default case, as any other fs
leveraging large folios?

IMHO, as I discussed with Kirill, we still need maintain compatibility
with the 'huge=' mount option. This means that if 'huge=never' is set
for tmpfs, huge page allocation will still be prohibited (which can
address Hugh's request?). However, if 'huge=' is not set, we can
allocate large folios based on the write size.

I consider allocating large folios in shmem/tmpfs on the write path less
controversial than allocating them on the page fault path -- especially
as long as we stay within the size to-be-written.

I think in RHEL THP on shmem/tmpfs are disabled as default (e.g.,
shmem_enabled=never). Maybe because of some rather undesired
side-effects (maybe some are historical?): I recall issues with VMs with
THP+ memory ballooning, as we cannot reclaim pages of folios if
splitting fails). I assume most of these problematic use cases don't use
tmpfs as an ordinary file system (write()/read()), but mmap() the whole
thing.

Sadly, I don't find any information about shmem/tmpfs + THP in the RHEL
documentation; most documentation is only concerned about anon THP.
Which makes me conclude that they are not suggested as of now.

I see more issues with allocating them on the page fault path and not
having a way to disable it -- compared to allocating them on the write()
path.

I may not understand your issues. IIUC, you can disable allocating huge
pages on the page fault path by using the 'huge=never' mount option or
setting shmem_enabled=deny. No?

That's what I am saying: if there is some way to disable it that will 
keep working, great.

--
Cheers,

David / dhildenb