Re: [RFC PATCH v3 0/4] Support large folios for tmpfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 05.11.24 13:45, Baolin Wang wrote:


On 2024/10/31 18:46, David Hildenbrand wrote:
[snip]

I don't like that:

(a) there is no way to explicitly enable/name that new behavior.

But this is similar to other file systems that enable large folios
(setting mapping_set_large_folios()), and I haven't seen any other file
systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
a bit special?

I'm afraid I don't have the energy to explain once more why I think
tmpfs is not just like any other file system in some cases.

And distributions are rather careful when it comes to something like
this ...


If we all agree that tmpfs is a bit special when using huge pages, then
fine, a Kconfig option might be needed.

(b) "always" etc. are only concerned about PMDs.

Yes, currently maintain the same semantics as before, in case users
still expect THPs.

Again, I don't think that is a reasonable approach to make PMD-sized
ones special here. It will all get seriously confusing and inconsistent.

I agree PMD-sized should not be special. This is all for backward
compatibility with the ‘huge=’ mount option, and adding a new kconfig is
also for this purpose.

THPs are opportunistic after all, and page fault behavior will remain
unchanged (PMD-sized) for now. And even if we support other sizes during
page faults, we'd like start with the largest size (PMD-size) first, and
it likely might just all work better than before.

Happy to learn where this really makes a difference.

Of course, if you change the default behavior (which you are planning),
it's ... a changed default.

If there are reasons to have more tunables regarding the sizes to use,
then it should not be limited to PMD-size.

I have tried to modify the code according to your suggestion (not tested
yet). These are what you had in mind?

static inline unsigned int
shmem_mapping_size_order(struct address_space *mapping, pgoff_t index,
loff_t write_end)
{
          unsigned int order;
          size_t size;

          if (!mapping_large_folio_support(mapping) || !write_end)
                  return 0;

          /* Calculate the write size based on the write_end */
          size = write_end - (index << PAGE_SHIFT);
          order = filemap_get_order(size);
          if (!order)
                  return 0;

          /* If we're not aligned, allocate a smaller folio */
          if (index & ((1UL << order) - 1))
                  order = __ffs(index);

          order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
          return order > 0 ? BIT(order + 1) - 1 : 0;
}

static unsigned int shmem_huge_global_enabled(struct inode *inode,
pgoff_t index,
                                                loff_t write_end, bool
shmem_huge_force,
                                                unsigned long vm_flags)
{
          bool is_shmem = inode->i_sb == shm_mnt->mnt_sb;
          unsigned long within_size_orders;
          unsigned int order;
          loff_t i_size;

          if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
                  return 0;
          if (!S_ISREG(inode->i_mode))
                  return 0;
          if (shmem_huge == SHMEM_HUGE_DENY)
                  return 0;
          if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
                  return BIT(HPAGE_PMD_ORDER);

          switch (SHMEM_SB(inode->i_sb)->huge) {
          case SHMEM_HUGE_NEVER:
                  return 0;
          case SHMEM_HUGE_ALWAYS:
                  if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
                          return BIT(HPAGE_PMD_ORDER);

                  return shmem_mapping_size_order(inode->i_mapping,
index, write_end);
          case SHMEM_HUGE_WITHIN_SIZE:
                  if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
                          within_size_orders = BIT(HPAGE_PMD_ORDER);
                  else
                          within_size_orders =
shmem_mapping_size_order(inode->i_mapping,
index, write_end);

                  order = highest_order(within_size_orders);
                  while (within_size_orders) {
                          index = round_up(index + 1, 1 << order);
                          i_size = max(write_end, i_size_read(inode));
                          i_size = round_up(i_size, PAGE_SIZE);
                          if (i_size >> PAGE_SHIFT >= index)
                                  return within_size_orders;

                          order = next_order(&within_size_orders, order);
                  }
                  fallthrough;
          case SHMEM_HUGE_ADVISE:
                  if (vm_flags & VM_HUGEPAGE) {
                          if (is_shmem || IS_ENABLED(USE_ONLY_THP_FOR_TMPFS))
                                  return BIT(HPAGE_PMD_ORDER);

                          return shmem_mapping_size_order(inode->i_mapping,
                                                          index, write_end);
                  }
                  fallthrough;
          default:
                  return 0;
          }
}

1) Add a new 'CONFIG_USE_ONLY_THP_FOR_TMPFS' kconfig to keep ‘huge=’
mount option compatibility.
2) For tmpfs write(), if CONFIG_USE_ONLY_THP_FOR_TMPFS is not enabled,
then will get the possible huge orders based on the write size.
3) For tmpfs mmap() fault, always use PMD-sized huge order.
4) For shmem, ignore the write size logic and always use PMD-sized THP
to check if the global huge is enabled.

However, in case 2), if 'huge=always' and write size is less than 4K, so
we will allocate small pages, that will break the 'huge' semantics?
Maybe it's not something to worry too much about.

Probably I didn't express clearly what I think we should, because this is
not quite what I had in mind.

I would use the CONFIG_USE_ONLY_THP_FOR_TMPFS way of doing it only if
really required. As raised, if someone needs finer control, providing that
only for a single size is rather limiting.



This is what I hope we can do (doc update to show what I mean):

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 5034915f4e8e8..d7d1a9acdbfc5 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -349,11 +349,24 @@ user, the PMD_ORDER hugepage policy will be overridden. If the policy for
 PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will
 default to ``never``.
-Hugepages in tmpfs/shmem
-========================
+tmpfs/shmem
+===========
-You can control hugepage allocation policy in tmpfs with mount option
-``huge=``. It can have following values:
+Traditionally, tmpfs only supported a single huge page size ("PMD"). Today,
+it also supports smaller sizes just like anonymous memory, often referred
+to as "multi-size THP" (mTHP). Huge pages of any size are commonly
+represented in the kernel as "large folios".
+
+While there is fine control over the huge page sizes to use for the internal
+shmem mount (see below), ordinary tmpfs mounts will make use of all
+available huge page sizes without any control over the exact sizes,
+behaving more like other file systems.
+
+tmpfs mounts
+------------
+
+The THP allocation policy for tmpfs mounts can be adjusted using the mount
+option: ``huge=``. It can have following values:
always
     Attempt to allocate huge pages every time we need a new page;
@@ -368,19 +381,20 @@ within_size
 advise
     Only allocate huge pages if requested with fadvise()/madvise();
-The default policy is ``never``.
+Remember, that the kernel may use huge pages of all available sizes, and
+that no fine control as for the internal tmpfs mount is available.
+
+The default policy in the past was ``never``, but it can now be adjusted
+using the CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_ALWAYS,
+CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_NEVER etc.
``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
 ``huge=never`` will not attempt to break up huge pages at all, just stop more
 from being allocated.
-There's also sysfs knob to control hugepage allocation policy for internal
-shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
-is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
-MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
-
-In addition to policies listed above, shmem_enabled allows two further
-values:
+In addition to policies listed above, the sysfs knob
+/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the
+allocation policy of tmpfs mounts, when set to the following values:
deny
     For use in emergencies, to force the huge option off from
@@ -388,13 +402,26 @@ deny
 force
     Force the huge option on for all - very useful for testing;
-Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to
-control mTHP allocation:
-'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
-and its value for each mTHP is essentially consistent with the global
-setting.  An 'inherit' option is added to ensure compatibility with these
-global settings.  Conversely, the options 'force' and 'deny' are dropped,
-which are rather testing artifacts from the old ages.
+
+shmem / internal tmpfs
+----------------------
+
+The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous
+mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM  objects, Ashmem.
+
+To control the THP allocation policy for this internal tmpfs mount, the
+sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs
+per THP size in
+'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled'
+can be used.
+
+The global knob has the same semantics as the ``huge=`` mount options
+for tmpfs mounts, except that the different huge page sizes can be controlled
+individually, and will only use the setting of the global knob when the
+per-size knob is set to 'inherit'.
+
+The options 'force' and 'deny' are dropped for the individual sizes, which
+are rather testing artifacts from the old ages.
always
     Attempt to allocate <size> huge pages every time we need a new page;
diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 56a26c843dbe9..10de8f706d07b 100644



There is this question of "do we need the old way of doing it and only
allocate PMDs". For that, likely a config similar to the one you propose might
make sense, but I would want to see if there is real demand for that. In particular:
for whom the smaller sizes are a problem when bigger (PMD) sizes were
enabled in the past.

--
Cheers,

David / dhildenb





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux