On 22/04/2020 19:34, Dave Hansen wrote:
On 4/22/20 7:25 AM, Steven Price wrote:
Because shmem can swap pages back in without restoring
the userspace PTE it is also necessary to add a hook for shmem.
I think the swap readahead code does this as well. It pulls the page
into the swap cache, but doesn't map it in.
The swap cache isn't a problem because the swap entry is still valid -
so the parallel tag cache is still associated with the page. shmem has
it's own cache so the tag data has to be transferred between the caches
(in this case stored back in the physical tag memory).
...
+static DEFINE_XARRAY(mte_pages);
+
+void *mte_allocate_tag_storage(void)
+{
+ /* tags granule is 16 bytes, 2 tags stored per byte */
+ return kmalloc(PAGE_SIZE / 16 / 2, GFP_KERNEL);
+}
Yikes, so this eats 2k of unmovable kernel memory per 64k of swap? This
is *probably* worth having its own slab just so the memory that's used
for it is less opaque. It could be pretty large. But, I guess if
you're worried about how much kernel memory this can eat, there's always
the swap cgroup controller to enforce limits.
Yes, this is probably something that will need tuning in the future. At
the moment we don't have much of an idea how much memory will have tags.
It's 'only' 2k per 64k of memory which has been mapped with PROT_MTE.
Obvious avenues for potential improvement are:
* A dedicated slab (as you suggest)
* Enabling swapping of the tags themselves
* Compressing the tags (there might be large runs of duplicated tag
values)
This also *increases* the footprint of a page while it's in the swap
cache. That's at least temporarily a _bit_ counterproductive.
Indeed. It *may* be possible to store the tags back in the physical tag
memory while the page is in the swap cache, but this would require
hooking into the path where the swap cache is freed without write-back
of the page data.
I guess there aren't any nice alternatives, though. I would imagine
that it would be substantially more complicated to rig the swap code up
to write the tag along with the data. Or, to store the tag data
somewhere *it* can be reclaimed, like in a kernel-internal shmem file or
something.
Yes, I'm attempting a simple (hopefully 'obviously right')
implementation. When we actually have some real applications making use
of this then we can look at optimisations - that will also allow
meaningful benchmarking of the optimisations.
+void mte_free_tag_storage(char *storage)
+{
+ kfree(storage);
+}
+
+int mte_save_tags(struct page *page)
+{
+ void *tag_storage, *ret;
+
+ if (!test_bit(PG_mte_tagged, &page->flags))
+ return 0;
+
+ tag_storage = mte_allocate_tag_storage();
+ if (!tag_storage)
+ return -ENOMEM;
+
+ mte_save_page_tags(page_address(page), tag_storage);
+
+ ret = xa_store(&mte_pages, page_private(page), tag_storage, GFP_KERNEL);
This is indexing into the xarray with the swap entry.val established in
do_swap_page()? Might be nice to make a note where it came from.
Good point, I'll put a comment in
+ if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
+ mte_free_tag_storage(tag_storage);
+ return xa_err(ret);
+ } else if (ret) {
+ mte_free_tag_storage(ret);
Is there a missing "return ret;" here? Otherwise, it seems like this
might silently fail to save the page's tags. I'm not sure what
non-xa_is_err() codes get returned, but if there is one, it could end up
here.
No, perhaps a comment is needed! xa_store() will return the value that
used to be at that index (or NULL if there wasn't an entry). So if the
swap index is being reused ret != NULL (and !xa_is_err(ret)) and in this
case we need to free the old storage to prevent a memory leak. So in
this path the tags have been successfully saved.
+ }
+
+ return 0;
+}
...
+void mte_sync_tags(pte_t *ptep, pte_t pte)
+{
+ struct page *page = pte_page(pte);
+ pte_t old_pte = READ_ONCE(*ptep);
+ swp_entry_t entry;
+
+ set_bit(PG_mte_tagged, &page->flags);
+
+ if (!is_swap_pte(old_pte))
+ return;
+
+ entry = pte_to_swp_entry(old_pte);
+ if (non_swap_entry(entry))
+ return;
+
+ mte_restore_tags(entry, page);
+}
Oh, here it is! This gets called when replacing a swap PTE with a
present PTE and restores the tags on swap-in.
Does this work for swap PTEs which were copied at fork()? I *think*
those might end up in here twice, once for the parent and another for
the child. If both read the page, both will fault and both will do a
set_pte() and end up in here. I don't think it will do any harm since
it will just set_bit(PG_mte_tagged) twice and restore the same tags
twice. But, it might be nice to call that out.
Yes, that's what should happen. Beyond adding another page flag (to
track whether the tags have been restored or not) I couldn't think of a
neat way of preventing this. And I believe it should be harmless
restoring it twice.
This function is a bit light on comments. It might make sense, for
instance to note that 'pte' is always a tagged PTE at this point.
Yes, I'll add some more comments here too.
Thanks for the review!
Steve