Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

David Hildenbrand <david@xxxxxxxxxx> · Fri, 7 Jun 2024 12:39:57 +0200




On 07.06.24 11:05, Daniel Gomez wrote:
On Thu, Jun 06, 2024 at 10:38:22AM +0200, David Hildenbrand wrote:
On 06.06.24 05:31, Baolin Wang wrote:


On 2024/6/4 20:05, Daniel Gomez wrote:
On Tue, Jun 04, 2024 at 05:45:20PM +0800, Baolin Wang wrote:


On 2024/6/4 16:18, Daniel Gomez wrote:
On Fri, May 31, 2024 at 01:13:48PM +0200, David Hildenbrand wrote:

As a default, we should not be using large folios / mTHP for any shmem,
just like we did with THP via shmem_enabled. This is what this series
currently does, and is aprt of the whole mTHP user-space interface design.

Further, the mTHP controls should control all of shmem, not only
"anonymous shmem".

Yes, that's what I thought and in my TODO list.

Good, it would be helpful to coordinate with Daniel and Pankaj.

I've integrated patches 11 and 12 from the lsf RFC thread [1] on top of Baolin's
v3 patches. You may find a version in my integration branch here [2]. I can
attach them here if it's preferred.

[1] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@xxxxxxxxxxx/
[2] https://protect2.fireeye.com/v1/url?k=a23e7c06-c3b56926-a23ff749-74fe485fb347-371ca2bfd5d9869f&q=1&e=6974304e-a786-4255-93a7-57498540241c&u=https%3A%2F%2Fgitlab.com%2Fdkruces%2Flinux-next%2F-%2Fcommits%2Fnext-20240604-shmem-mthp

The point here is to combine the large folios strategy I proposed with mTHP
user controls. Would it make sense to limit the orders to the mapping order
calculated based on the size and index?

IMO, for !anon shmem, this change makes sense to me. We should respect the
size and mTHP should act as a order filter.

What about respecing the size when within_size flag is enabled? Then, 'always'
would allocate mTHP enabled folios, regardless of the size. And 'never'
would ignore mTHP and size. So, 'never' can be used for this 'safe' boot case
mentioned in the discussion.

Looks reasonable to me. What do you think, David?


That mimics existing PMD-THP behavior, right?

With "within_size", we must not exceed the size, with "always", we may
exceed the size.

But right now we only check the inode size. With large folio support in
write_iter() we can have access to the length as well. I think this would solve
(paratially?) the cases where we don't have access to the file size and if we
perform writes in bigger chunks.

E.g. xfs_io -t -f -c "pwrite -b 2M -S 0x58 0 2M" /mnt/test/file

For 'within_size', the example above would allocate 512 pages instead of one
huge page. After patches [1] [2] we can get the size of the write to allocate
whatever mTHP/THP makes more sense for the length being passed.


Yes, although this sounds like an optimization/change we should be doing 
separately I guess.

[1] https://lore.kernel.org/all/20240527163616.1135968-2-hch@xxxxxx/
[2] https://lore.kernel.org/all/20240515055719.32577-12-da.gomez@xxxxxxxxxxx/

Here a quick hack for THP:

@@ -561,7 +561,8 @@ bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force,
         case SHMEM_HUGE_WITHIN_SIZE:
                 index = round_up(index + 1, HPAGE_PMD_NR);
                 i_size = round_up(i_size_read(inode), PAGE_SIZE);
-               if (i_size >> PAGE_SHIFT >= index)
+               if ((i_size >> PAGE_SHIFT >= index) ||
+                   (len >> PAGE_SHIFT >= index))
                         return true;
                 fallthrough;



And what about 'advise' option? Silimar to 'within_size'?

Good question. What's the behavior of PMD-THP? I would assume it behaves
like "within_size", because in the common case we mmap (+advise) only within
the size of the file, not exceeding it.

It allocates a huge page on request when MADV_HUGEPAGE (regardless of the size).


Interesting, so we should do the same. Thanks!

--
Cheers,

David / dhildenb