On 29.03.22 22:42, Khalid Aziz wrote: > On 3/29/22 07:59, David Hildenbrand wrote: >> On 15.03.22 11:47, David Hildenbrand wrote: >>> In case arch_unmap_one() fails, we already did a swap_duplicate(). let's >>> undo that properly via swap_free(). >>> >>> Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap") >>> Reviewed-by: Khalid Aziz <khalid.aziz@xxxxxxxxxx> >>> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> >>> --- >>> mm/rmap.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 6a1e8c7f6213..f825aeef61ca 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1625,6 +1625,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>> break; >>> } >>> if (arch_unmap_one(mm, vma, address, pteval) < 0) { >>> + swap_free(entry); >>> set_pte_at(mm, address, pvmw.pte, pteval); >>> ret = false; >>> page_vma_mapped_walk_done(&pvmw); >> >> Hi Khalid, >> >> I'm a bit confused about the semantics if arch_unmap_one(), I hope you can clarify. >> >> >> See patch #11 in this series, were we can fail unmapping after arch_unmap_one() succeeded. E.g., >> >> @@ -1623,6 +1634,24 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> page_vma_mapped_walk_done(&pvmw); >> break; >> } >> + if (anon_exclusive && >> + page_try_share_anon_rmap(subpage)) { >> + swap_free(entry); >> + set_pte_at(mm, address, pvmw.pte, pteval); >> + ret = false; >> + page_vma_mapped_walk_done(&pvmw); >> + break; >> + } >> + /* >> + * Note: We don't remember yet if the page was mapped >> + * exclusively in the swap entry, so swapin code has >> + * to re-determine that manually and might detect the >> + * page as possibly shared, for example, if there are >> + * other references on the page or if the page is under >> + * writeback. We made sure that there are no GUP pins >> + * on the page that would rely on it, so for GUP pins >> + * this is fine. >> + */ >> if (list_empty(&mm->mmlist)) { >> spin_lock(&mmlist_lock); >> if (list_empty(&mm->mmlist)) >> >> >> For now, I was under the impression that we don't have to undo anything after >> arch_unmap_one() succeeded, because we seem to not do anything for two >> cases below. But looking into arch_unmap_one() and how it allocates stuff I do >> wonder what we would actually want to do here -- I'd assume we'd want to >> trigger the del_tag_store() somehow? > > Hi David, > Hi, thanks for your fast reply. > Currently once arch_unmap_one() completes successfully, we are at the point of no return for this pte. It will be > replaced by swap pte soon thereafter. Patch 11 adds another case where we may return without replacing current pte with > swap pte. For now could you resolve this by moving the above code block in patch 11 to before the call to > arch_unmap_one(). That still leaves open the issue having the flexibility of undoing what arch_unmap_one() does for some > other reason in future. That will require coming up with a properly architected way to do it. I really want clearing PG_anon_exclusive be the last action, without eventually having to set it again and overcomplicating PG_anon_exclusive/rmap handling. Ideally, we'd have a "arch_remap_one()" that just reverts what arch_unmap_one() did. > >> >> arch_unmap_one() calls adi_save_tags(), which allocates memory. >> adi_restore_tags()->del_tag_store() reverts that operation and ends up >> freeing memory conditionally; However, it's only >> called from arch_do_swap_page(). >> >> >> Here is why I have to scratch my head: >> >> a) arch_do_swap_page() is only called from do_swap_page(). We don't do anything similar >> for mm/swapfile.c:unuse_pte(), aren't we missing something? > > My understanding of this code path maybe flawed, so do correct me if this does not sound right. unused_pte() is called > upon user turning off swap on a device. unused_pte() is called by unused_pte_range() which swaps the page back in from > swap device before calling unuse_pte(). Once the page is read back in from swap, ultimately access to the va for the > page will result in call to __handle_mm_fault() which in turn will call handle_pte_fault() to insert a new pte for this > mapping and handle_pte_fault() will call arch_do_swap_page() which will restore the tags. unuse_pte() will replace a swap pte directly by a proper, present pte, just like do_swap_page() would. You won't end up in do_swap_page() anymore and arch_do_swap_page() won't be called, because there is no swap PTE anymore. > >> >> b) try_to_migrate_one() does the arch_unmap_one(), but who will do the >> restore+free after migration succeeded or failed, aren't we missing something? > > try_to_migrate_one() replaces the current pte with a migration pte after calling arch_unmap_one(). This causes > __handle_mm_fault() to be called when a reference to the va covered by migration pte is made. This will in turn finally > result in a call to arch_do_swap_page() which restores the tags. Migration PTEs are restore via mm/migrate.c:remove_migration_ptes(). arch_do_swap_page() won't be called. What you mention is if someone accesses the migration PTE while migration is active and the migration PTEs have not been removed yet. While we'll end up in do_swap_page(), we'll do a migration_entry_wait(), followed by an effective immediate "return 0;". arch_do_swap_page() won't get called. So in essence, I think this doesn't work as expected yet. In the best case we don't immediately free memory. In the worst case we lose the tags. -- Thanks, David / dhildenb