Re: [PATCH v2 01/15] mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed

David Hildenbrand <david@xxxxxxxxxx> · Tue, 29 Mar 2022 22:55:03 +0200

On 29.03.22 22:42, Khalid Aziz wrote:
> On 3/29/22 07:59, David Hildenbrand wrote:
>> On 15.03.22 11:47, David Hildenbrand wrote:
>>> In case arch_unmap_one() fails, we already did a swap_duplicate(). let's
>>> undo that properly via swap_free().
>>>
>>> Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap")
>>> Reviewed-by: Khalid Aziz <khalid.aziz@xxxxxxxxxx>
>>> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
>>> ---
>>>   mm/rmap.c | 1 +
>>>   1 file changed, 1 insertion(+)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 6a1e8c7f6213..f825aeef61ca 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1625,6 +1625,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>   				break;
>>>   			}
>>>   			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
>>> +				swap_free(entry);
>>>   				set_pte_at(mm, address, pvmw.pte, pteval);
>>>   				ret = false;
>>>   				page_vma_mapped_walk_done(&pvmw);
>>
>> Hi Khalid,
>>
>> I'm a bit confused about the semantics if arch_unmap_one(), I hope you can clarify.
>>
>>
>> See patch #11 in this series, were we can fail unmapping after arch_unmap_one() succeeded. E.g.,
>>
>> @@ -1623,6 +1634,24 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>   				page_vma_mapped_walk_done(&pvmw);
>>   				break;
>>   			}
>> +			if (anon_exclusive &&
>> +			    page_try_share_anon_rmap(subpage)) {
>> +				swap_free(entry);
>> +				set_pte_at(mm, address, pvmw.pte, pteval);
>> +				ret = false;
>> +				page_vma_mapped_walk_done(&pvmw);
>> +				break;
>> +			}
>> +			/*
>> +			 * Note: We don't remember yet if the page was mapped
>> +			 * exclusively in the swap entry, so swapin code has
>> +			 * to re-determine that manually and might detect the
>> +			 * page as possibly shared, for example, if there are
>> +			 * other references on the page or if the page is under
>> +			 * writeback. We made sure that there are no GUP pins
>> +			 * on the page that would rely on it, so for GUP pins
>> +			 * this is fine.
>> +			 */
>>   			if (list_empty(&mm->mmlist)) {
>>   				spin_lock(&mmlist_lock);
>>   				if (list_empty(&mm->mmlist))
>>
>>
>> For now, I was under the impression that we don't have to undo anything after
>> arch_unmap_one() succeeded, because we seem to not do anything for two
>> cases below. But looking into arch_unmap_one() and how it allocates stuff I do
>> wonder what we would actually want to do here -- I'd assume we'd want to
>> trigger the del_tag_store() somehow?
> 
> Hi David,
> 

Hi,

thanks for your fast reply.

> Currently once arch_unmap_one() completes successfully, we are at the point of no return for this pte. It will be 
> replaced by swap pte soon thereafter. Patch 11 adds another case where we may return without replacing current pte with 
> swap pte. For now could you resolve this by moving the above code block in patch 11 to before the call to 
> arch_unmap_one(). That still leaves open the issue having the flexibility of undoing what arch_unmap_one() does for some 
> other reason in future. That will require coming up with a properly architected way to do it.

I really want clearing PG_anon_exclusive be the last action, without
eventually having to set it again and overcomplicating
PG_anon_exclusive/rmap handling. Ideally, we'd have a "arch_remap_one()"
that just reverts what arch_unmap_one() did.

> 
>>
>> arch_unmap_one() calls adi_save_tags(), which allocates memory.
>> adi_restore_tags()->del_tag_store() reverts that operation and ends up
>> freeing memory conditionally; However, it's only
>> called from arch_do_swap_page().
>>
>>
>> Here is why I have to scratch my head:
>>
>> a) arch_do_swap_page() is only called from do_swap_page(). We don't do anything similar
>> for mm/swapfile.c:unuse_pte(), aren't we missing something?
> 
> My understanding of this code path maybe flawed, so do correct me if this does not sound right. unused_pte() is called 
> upon user turning off swap on a device. unused_pte() is called by unused_pte_range() which swaps the page back in from 
> swap device before calling unuse_pte(). Once the page is read back in from swap, ultimately access to the va for the 
> page will result in call to __handle_mm_fault() which in turn will call handle_pte_fault() to insert a new pte for this 
> mapping and handle_pte_fault() will call arch_do_swap_page() which will restore the tags.

unuse_pte() will replace a swap pte directly by a proper, present pte,
just like do_swap_page() would. You won't end up in do_swap_page()
anymore and arch_do_swap_page() won't be called, because there is no
swap PTE anymore.

> 
>>
>> b) try_to_migrate_one() does the arch_unmap_one(), but who will do the
>> restore+free after migration succeeded or failed, aren't we missing something?
> 
> try_to_migrate_one() replaces the current pte with a migration pte after calling arch_unmap_one(). This causes 
> __handle_mm_fault() to be called when a reference to the va covered by migration pte is made. This will in turn finally 
> result in a call to arch_do_swap_page() which restores the tags.

Migration PTEs are restore via mm/migrate.c:remove_migration_ptes().
arch_do_swap_page() won't be called.

What you mention is if someone accesses the migration PTE while
migration is active and the migration PTEs have not been removed yet.
While we'll end up in do_swap_page(), we'll do a migration_entry_wait(),
followed by an effective immediate "return 0;". arch_do_swap_page()
won't get called.

So in essence, I think this doesn't work as expected yet. In the best
case we don't immediately free memory. In the worst case we lose the tags.

-- 
Thanks,

David / dhildenb