Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry()

Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> · Tue, 25 Feb 2025 18:15:47 +0800

On 2025/2/25 17:20, Baolin Wang wrote:


On 2025/2/21 10:38, Zi Yan wrote:
On 20 Feb 2025, at 21:33, Zi Yan wrote:

On 20 Feb 2025, at 8:06, Zi Yan wrote:

On 20 Feb 2025, at 4:27, Baolin Wang wrote:

On 2025/2/20 17:07, Baolin Wang wrote:


On 2025/2/20 00:10, Zi Yan wrote:
On 19 Feb 2025, at 5:04, Baolin Wang wrote:

Hi Zi,

Sorry for the late reply due to being busy with other things:)

Thank you for taking a look at the patches. :)


On 2025/2/19 07:54, Zi Yan wrote:
During shmem_split_large_entry(), large swap entries are 
covering n slots
and an order-0 folio needs to be inserted.

Instead of splitting all n slots, only the 1 slot covered by 
the folio
need to be split and the remaining n-1 shadow entries can be 
retained with
orders ranging from 0 to n-1.  This method only requires
(n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
(n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
xas_split_alloc() + xas_split() one.

For example, to split an order-9 large swap entry (assuming 
XA_CHUNK_SHIFT
is 6), 1 xa_node is needed instead of 8.

xas_try_split_min_order() is used to reduce the number of calls to
xas_try_split() during split.

For shmem swapin, if we cannot swap in the whole large folio by 
skipping the swap cache, we will split the large swap entry 
stored in the shmem mapping into order-0 swap entries, rather 
than splitting it into other orders of swap entries. This is 
because the next time we swap in a shmem folio through 
shmem_swapin_cluster(), it will still be an order 0 folio.

Right. But the swapin is one folio at a time, right? 
shmem_split_large_entry()

Yes, now we always swapin an order-0 folio from the async swap 
device at a time. However, for sync swap device, we will skip the 
swapcache and swapin the whole large folio by commit 1dd44c0af4fa, 
so it will not call shmem_split_large_entry() in this case.

Got it. I will check the commit.


should split the large swap entry and give you a slot to store 
the order-0 folio.
For example, with an order-9 large swap entry, to swap in first 
order-0 folio,
the large swap entry will become order-0, order-0, order-1, 
order-2,… order-8,
after the split. Then the first order-0 swap entry can be used.
Then, when a second order-0 is swapped in, the second order-0 can 
be used.
When the last order-0 is swapped in, the order-8 would be split to
order-7,order-6,…,order-1,order-0, order-0, and the last order-0 
will be used.

Yes, understood. However, for the sequential swapin scenarios, 
where originally only one split operation is needed. However, your 
approach increases the number of split operations. Of course, I 
understand that in non-sequential swapin scenarios, your patch 
will save some xarray memory. It might be necessary to evaluate 
whether the increased split operations will have a significant 
impact on the performance of sequential swapin?

Is there a shmem swapin test I can run to measure this? 
xas_try_split() should
performance similar operations as existing 
xas_split_alloc()+xas_split().


Maybe the swapin assumes after shmem_split_large_entry(), all 
swap entries
are order-0, which can lead to issues. There should be some check 
like
if the swap entry order > folio_order, shmem_split_large_entry() 
should
be used.

Moreover I did a quick test with swapping in order 6 shmem 
folios, however, my test hung, and the console was continuously 
filled with the following information. It seems there are some 
issues with shmem swapin handling. Anyway, I need more time to 
debug and test.
To swap in order-6 folios, shmem_split_large_entry() does not 
allocate
any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
error below. Let me know if there is anything I can help.

I encountered some issues while testing order 4 and order 6 swapin 
with your patches. And I roughly reviewed the patch, and it seems 
that the new swap entry stored in the shmem mapping was not 
correctly updated after the split.

The following logic is to reset the swap entry after split, and I 
assume that the large swap entry is always split to order 0 
before. As your patch suggests, if a non-uniform split is used, 
then the logic for resetting the swap entry needs to be changed? 
Please correct me if I missed something.

/*
   * Re-set the swap entry after splitting, and the swap
   * offset of the original large entry must be continuous.
   */
for (i = 0; i < 1 << order; i++) {
      pgoff_t aligned_index = round_down(index, 1 << order);
      swp_entry_t tmp;

      tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
      __xa_store(&mapping->i_pages, aligned_index + i,
             swp_to_radix_entry(tmp), 0);
}

Right. I will need to adjust swp_entry_t. Thanks for pointing this out.


In addition, after your patch, the shmem_split_large_entry() seems 
always return 0 even though it splits a large swap entry, but we 
still need re-calculate the swap entry value after splitting, 
otherwise it may return errors due to shmem_confirm_swap() 
validation failure.

/*
  * If the large swap entry has already been split, it is
  * necessary to recalculate the new swap entry based on
  * the old order alignment.
  */
  if (split_order > 0) {
    pgoff_t offset = index - round_down(index, 1 << split_order);

    swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
}

Got it. I will fix it.

BTW, do you mind sharing your swapin tests so that I can test my new 
version
properly?

The diff below adjusts the swp_entry_t and returns the right order after
shmem_split_large_entry(). Let me know if it fixes your issue.

Fixed the compilation error. It will be great if you can share a 
swapin test, so that
I can test locally. Thanks.

diff --git a/mm/shmem.c b/mm/shmem.c
index b35ba250c53d..bfc4ef511391 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2162,7 +2162,7 @@ static int shmem_split_large_entry(struct inode 
*inode, pgoff_t index,
  {
      struct address_space *mapping = inode->i_mapping;
      XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
-    int split_order = 0;
+    int split_order = 0, entry_order = 0;
      int i;

      /* Convert user data gfp flags to xarray node gfp flags */
@@ -2180,6 +2180,7 @@ static int shmem_split_large_entry(struct inode 
*inode, pgoff_t index,
          }

          order = xas_get_order(&xas);
+        entry_order = order;

          /* Try to split large swap entry in pagecache */
          if (order > 0) {
@@ -2192,23 +2193,23 @@ static int shmem_split_large_entry(struct 
inode *inode, pgoff_t index,
                  xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
                  if (xas_error(&xas))
                      goto unlock;
+
+                /*
+                 * Re-set the swap entry after splitting, and the swap
+                 * offset of the original large entry must be 
continuous.
+                 */
+                for (i = 0; i < 1 << cur_order; i += (1 << 
split_order)) {
+                    pgoff_t aligned_index = round_down(index, 1 << 
cur_order);
+                    swp_entry_t tmp;
+
+                    tmp = swp_entry(swp_type(swap), swp_offset(swap) 
+ i);
+                    __xa_store(&mapping->i_pages, aligned_index + i,
+                           swp_to_radix_entry(tmp), 0);
+                }
                  cur_order = split_order;
                  split_order =
                      xas_try_split_min_order(split_order);
              }

This looks incorrect to me. Suppose we are splitting an order-9 swap 
entry, in the first iteration of the loop, it splits the order-9 swap 
entry into 8 order-6 swap entries. At this point, the order-6 swap 
entries are reset, and everything seems fine.

However, in the second iteration, where an order-6 swap entry is split 
into 63 order-0 swap entries, the split operation itself is correct. But 

typo: 64

when resetting the order-0 swap entry, it seems incorrect. Now the 
'cur_order' = 6 and 'split_order' = 0, which means the range for the 
reset index is always between 0 and 63 (see __xa_store()).

Sorry for confusing. The 'aligned_index' will be rounded down by 
'cur_order' (which is 6), so the index is correct. But the swap offset 
calculated by 'swp_offset(swap) + i' looks incorrect, cause the 'i' is 
always between 0 and 63.

 > +for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
 > +    pgoff_t aligned_index = round_down(index, 1 << cur_order);
 > +    swp_entry_t tmp;
 > +
 > +    tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
 > +    __xa_store(&mapping->i_pages, aligned_index + i,
 > +        swp_to_radix_entry(tmp), 0);
 > +}

However, if the index is greater than 63, it appears that it is not set 
to the correct new swap entry after splitting. Please corect me if I 
missed anyting.

-
-            /*
-             * Re-set the swap entry after splitting, and the swap
-             * offset of the original large entry must be continuous.
-             */
-            for (i = 0; i < 1 << order; i++) {
-                pgoff_t aligned_index = round_down(index, 1 << order);
-                swp_entry_t tmp;
-
-                tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
-                __xa_store(&mapping->i_pages, aligned_index + i,
-                       swp_to_radix_entry(tmp), 0);
-            }
          }

  unlock:
@@ -2221,7 +2222,7 @@ static int shmem_split_large_entry(struct inode 
*inode, pgoff_t index,
      if (xas_error(&xas))
          return xas_error(&xas);

-    return split_order;
+    return entry_order;
  }

  /*


Best Regards,
Yan, Zi