Re: [RFC PATCH v3 0/4] Support large folios for tmpfs

Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> · Tue, 5 Nov 2024 20:45:01 +0800

On 2024/10/31 18:46, David Hildenbrand wrote:
[snip]

I don't like that:

(a) there is no way to explicitly enable/name that new behavior.

But this is similar to other file systems that enable large folios
(setting mapping_set_large_folios()), and I haven't seen any other file
systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
a bit special?

I'm afraid I don't have the energy to explain once more why I think 
tmpfs is not just like any other file system in some cases.

And distributions are rather careful when it comes to something like 
this ...

If we all agree that tmpfs is a bit special when using huge pages, then
fine, a Kconfig option might be needed.

(b) "always" etc. are only concerned about PMDs.

Yes, currently maintain the same semantics as before, in case users
still expect THPs.

Again, I don't think that is a reasonable approach to make PMD-sized 
ones special here. It will all get seriously confusing and inconsistent.

I agree PMD-sized should not be special. This is all for backward 
compatibility with the ‘huge=’ mount option, and adding a new kconfig is 
also for this purpose.

THPs are opportunistic after all, and page fault behavior will remain 
unchanged (PMD-sized) for now. And even if we support other sizes during 
page faults, we'd like start with the largest size (PMD-size) first, and 
it likely might just all work better than before.

Happy to learn where this really makes a difference.

Of course, if you change the default behavior (which you are planning), 
it's ... a changed default.

If there are reasons to have more tunables regarding the sizes to use, 
then it should not be limited to PMD-size.

I have tried to modify the code according to your suggestion (not tested 
yet). These are what you had in mind?

static inline unsigned int
shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, 
loff_t write_end)
{
        unsigned int order;
        size_t size;

        if (!mapping_large_folio_support(mapping) || !write_end)
                return 0;

        /* Calculate the write size based on the write_end */
        size = write_end - (index << PAGE_SHIFT);
        order = filemap_get_order(size);
        if (!order)
                return 0;

        /* If we're not aligned, allocate a smaller folio */
        if (index & ((1UL << order) - 1))
                order = __ffs(index);

        order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
        return order > 0 ? BIT(order + 1) - 1 : 0;
}

static unsigned int shmem_huge_global_enabled(struct inode *inode, 
pgoff_t index,
                                              loff_t write_end, bool 
shmem_huge_force,
                                              unsigned long vm_flags)
{
        bool is_shmem = inode->i_sb == shm_mnt->mnt_sb;
        unsigned long within_size_orders;
        unsigned int order;
        loff_t i_size;

        if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
                return 0;
        if (!S_ISREG(inode->i_mode))
                return 0;
        if (shmem_huge == SHMEM_HUGE_DENY)
                return 0;
        if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
                return BIT(HPAGE_PMD_ORDER);

        switch (SHMEM_SB(inode->i_sb)->huge) {
        case SHMEM_HUGE_NEVER:
                return 0;
        case SHMEM_HUGE_ALWAYS:
                if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
                        return BIT(HPAGE_PMD_ORDER);

                return shmem_mapping_size_order(inode->i_mapping, 
index, write_end);
        case SHMEM_HUGE_WITHIN_SIZE:
                if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
                        within_size_orders = BIT(HPAGE_PMD_ORDER);
                else
                        within_size_orders = 
shmem_mapping_size_order(inode->i_mapping,

index, write_end);

                order = highest_order(within_size_orders);
                while (within_size_orders) {
                        index = round_up(index + 1, 1 << order);
                        i_size = max(write_end, i_size_read(inode));
                        i_size = round_up(i_size, PAGE_SIZE);
                        if (i_size >> PAGE_SHIFT >= index)
                                return within_size_orders;

                        order = next_order(&within_size_orders, order);
                }
                fallthrough;
        case SHMEM_HUGE_ADVISE:
                if (vm_flags & VM_HUGEPAGE) {
                        if (is_shmem || IS_ENABLED(USE_ONLY_THP_FOR_TMPFS))
                                return BIT(HPAGE_PMD_ORDER);

                        return shmem_mapping_size_order(inode->i_mapping,
                                                        index, write_end);
                }
                fallthrough;
        default:
                return 0;
        }
}

1) Add a new 'CONFIG_USE_ONLY_THP_FOR_TMPFS' kconfig to keep ‘huge=’ 
mount option compatibility.
2) For tmpfs write(), if CONFIG_USE_ONLY_THP_FOR_TMPFS is not enabled, 
then will get the possible huge orders based on the write size.
3) For tmpfs mmap() fault, always use PMD-sized huge order.
4) For shmem, ignore the write size logic and always use PMD-sized THP 
to check if the global huge is enabled.

However, in case 2), if 'huge=always' and write size is less than 4K, so 
we will allocate small pages, that will break the 'huge' semantics? 
Maybe it's not something to worry too much about.

huge=never: No THPs of any size
huge=always: THPs of any size
huge=fadvise: like "always" but only with fadvise/madvise
huge=within_size: like "fadvise" but respect i_size

"huge=" default depends on a Kconfig option.

With that we:

(1) Maximize the cases where we will use large folios of any sizes
      (which Willy cares about).
(2) Have a way to disable them completely (which I care about).
(3) Allow distros to keep the default unchanged.

Likely, for now we will only try allocating PMD-sized THPs during page
faults, and allocate different sizes only during write(). So the effect
for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be
completely unchanged even with "huge=always".

It will get more tricky once we change that behavior as well, but that's
something to likely figure out if it is a real problem at at different
day :)

I really preferred using the sysfs toggles (as discussed with Hugh in
the meeting back then), but I can also understand why we at least want
to try making tmpfs behave more like other file systems. But I'm a bit
more careful to not ignore the cases where it really isn't like any
other file system.

That's also my previous thought, but Matthew is strongly against that.
Let's step by step.

Yes, I understand his view as well.

But I won't blindly agree to the "tmpfs is just like any other file 
system" opinion :)

 > >> If we start making PMD-sized THPs special in any non-configurable 
way,
then we are effectively off *worse* than allowing to configure them
properly. So if someone voices "but we want only PMD-sized" ones, the
next one will say "but we only want cont-pte sized-ones" and then we
should provide an option to control the actual sizes to use differently,
in some way. But let's see if that is even required.

Yes, I agree. So what I am thinking is, the 'huge=' option should be
gradually deprecated in the future and eventually tmpfs can allocate any
size large folios as default.

Let's be realistic, it won't get removed any time soon. ;)

So changing "huge=always" etc. semantics to reflect our new size 
options, and then try changing the default (with the option for 
people/distros to have the old default) is a reasonable approach, at 
least to me.

I'm trying to stay open-minded here, but the proposal I heard so far is 
not particularly appealing.