On 2024/8/26 05:55, Hugh Dickins wrote:
On Mon, 12 Aug 2024, Baolin Wang wrote:
In the following patches, shmem will support the swap out of large folios,
which means the shmem mappings may contain large order swap entries, so
using xa_get_order() to get the folio order of the shmem swap entry to
update the '*start' correctly.
Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
---
mm/filemap.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index 4130be74f6fd..4c312aab8b1f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2056,6 +2056,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
folio = fbatch->folios[idx];
if (!xa_is_value(folio))
nr = folio_nr_pages(folio);
+ else
+ nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
*start = indices[idx] + nr;
}
return folio_batch_count(fbatch);
@@ -2120,6 +2122,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
folio = fbatch->folios[idx];
if (!xa_is_value(folio))
nr = folio_nr_pages(folio);
+ else
+ nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
*start = indices[idx] + nr;
}
return folio_batch_count(fbatch);
--
Here we have a problem, but I'm not suggesting a fix for it yet: I
need to get other fixes out first, then turn to thinking about this -
it's not easy.
Thanks for raising the issues.
That code is almost always right, so it works well enough for most
people not to have noticed, but there are two issues with it.
The first issue is that it's assuming indices[idx] is already aligned
to nr: not necessarily so. I believe it was already wrong in the
folio_nr_pages() case, but the time I caught it wrong with a printk
was in the swap (value) case. (I may not be stating this correctly:
again more thought needed but I can't spend so long writing.)
And immediately afterwards that kernel build failed with a corrupted
(all 0s) .o file - I'm building on ext4 on /dev/loop0 on huge tmpfs while
swapping, and happen to be using the "-o discard" option to ext4 mount.
I've been chasing these failures (maybe a few minutes in, maybe half an
hour) for days, then had the idea of trying without the "-o discard":
whereupon these builds can be repeated successfully for many hours.
IIRC ext4 discard to /dev/loop0 entails hole-punch to the tmpfs.
The alignment issue can easily be corrected, but that might not help.
(And those two functions would look better with the rcu_read_unlock()
moved down to just before returning: but again, wouldn't really help.)
The second issue is that swap is more slippery to work with than
folios or pages: in the folio_nr_pages() case, that number is stable
because we hold a refcount (which stops a THP from being split), and
later we'll be taking folio lock too. None of that in the swap case,
so (depending on how a large entry gets split) the xa_get_order() result
is not reliable. Likewise for other uses of xa_get_order() in this series.
Now we have 2 users of xa_get_order() in this series:
1) shmem_partial_swap_usage(): this is acceptable, since racy results
are not a problem for the swap statistics.
2) shmem_undo_range(): when freeing a large swap entry, it will use
xa_cmpxchg_irq() to make sure the swap value is not changed (in case the
large swap entry is split). If failed to cmpxchg, then it will use
current index to retry in shmem_undo_range(). So seems not too bad?
static long shmem_free_swap(struct address_space *mapping,
pgoff_t index, void *radswap)
{
int order = xa_get_order(&mapping->i_pages, index);
void *old;
old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
if (old != radswap)
return 0;
free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order);
return 1 << order;
}
There needs to be some kind of locking or retry to make the order usable,
and to avoid shmem_free_swap() occasionally freeing more than it ought.
I'll give it thought after.