On 26.08.22 16:32, Peter Xu wrote: > On Fri, Aug 26, 2022 at 11:02:58AM +1000, Alistair Popple wrote: >> >> Peter Xu <peterx@xxxxxxxxxx> writes: >> >>> On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote: >>>> >>>> Peter Xu <peterx@xxxxxxxxxx> writes: >>>> >>>>> On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote: >>>>>> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that >>>>>> installs migration entries directly if it can lock the migrating page. >>>>>> When removing a dirty pte the dirty bit is supposed to be carried over >>>>>> to the underlying page to prevent it being lost. >>>>>> >>>>>> Currently migrate_vma_*() can only be used for private anonymous >>>>>> mappings. That means loss of the dirty bit usually doesn't result in >>>>>> data loss because these pages are typically not file-backed. However >>>>>> pages may be backed by swap storage which can result in data loss if an >>>>>> attempt is made to migrate a dirty page that doesn't yet have the >>>>>> PageDirty flag set. >>>>>> >>>>>> In this case migration will fail due to unexpected references but the >>>>>> dirty pte bit will be lost. If the page is subsequently reclaimed data >>>>>> won't be written back to swap storage as it is considered uptodate, >>>>>> resulting in data loss if the page is subsequently accessed. >>>>>> >>>>>> Prevent this by copying the dirty bit to the page when removing the pte >>>>>> to match what try_to_migrate_one() does. >>>>>> >>>>>> Signed-off-by: Alistair Popple <apopple@xxxxxxxxxx> >>>>>> Acked-by: Peter Xu <peterx@xxxxxxxxxx> >>>>>> Reported-by: Huang Ying <ying.huang@xxxxxxxxx> >>>>>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") >>>>>> Cc: stable@xxxxxxxxxxxxxxx >>>>>> >>>>>> --- >>>>>> >>>>>> Changes for v3: >>>>>> >>>>>> - Defer TLB flushing >>>>>> - Split a TLB flushing fix into a separate change. >>>>>> >>>>>> Changes for v2: >>>>>> >>>>>> - Fixed up Reported-by tag. >>>>>> - Added Peter's Acked-by. >>>>>> - Atomically read and clear the pte to prevent the dirty bit getting >>>>>> set after reading it. >>>>>> - Added fixes tag >>>>>> --- >>>>>> mm/migrate_device.c | 9 +++++++-- >>>>>> 1 file changed, 7 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >>>>>> index 6a5ef9f..51d9afa 100644 >>>>>> --- a/mm/migrate_device.c >>>>>> +++ b/mm/migrate_device.c >>>>>> @@ -7,6 +7,7 @@ >>>>>> #include <linux/export.h> >>>>>> #include <linux/memremap.h> >>>>>> #include <linux/migrate.h> >>>>>> +#include <linux/mm.h> >>>>>> #include <linux/mm_inline.h> >>>>>> #include <linux/mmu_notifier.h> >>>>>> #include <linux/oom.h> >>>>>> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>>>>> anon_exclusive = PageAnon(page) && PageAnonExclusive(page); >>>>>> if (anon_exclusive) { >>>>>> flush_cache_page(vma, addr, pte_pfn(*ptep)); >>>>>> - ptep_clear_flush(vma, addr, ptep); >>>>>> + pte = ptep_clear_flush(vma, addr, ptep); >>>>>> >>>>>> if (page_try_share_anon_rmap(page)) { >>>>>> set_pte_at(mm, addr, ptep, pte); >>>>>> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>>>>> goto next; >>>>>> } >>>>>> } else { >>>>>> - ptep_get_and_clear(mm, addr, ptep); >>>>>> + pte = ptep_get_and_clear(mm, addr, ptep); >>>>>> } >>>>> >>>>> I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are >>>>> moved above the condition check so they're called unconditionally. Could >>>>> you explain the rational on why it's changed back (since I think v2 was the >>>>> correct approach)? >>>> >>>> Mainly because I agree with your original comments, that it would be >>>> better to keep the batching of TLB flushing if possible. After the >>>> discussion I don't think there is any issues with HW pte dirty bits >>>> here. There are already other cases where HW needs to get that right >>>> anyway (eg. zap_pte_range). >>> >>> Yes tlb batching was kept, thanks for doing that way. Though if only apply >>> patch 1 we'll have both ptep_clear_flush() and batched flush which seems to >>> be redundant. >>> >>>> >>>>> The other question is if we want to split the patch, would it be better to >>>>> move the tlb changes to patch 1, and leave the dirty bit fix in patch 2? >>>> >>>> Isn't that already the case? Patch 1 moves the TLB flush before the PTL >>>> as suggested, patch 2 atomically copies the dirty bit without changing >>>> any TLB flushing. >>> >>> IMHO it's cleaner to have patch 1 fix batch flush, replace >>> ptep_clear_flush() with ptep_get_and_clear() and update pte properly. >> >> Which ptep_clear_flush() are you referring to? This one? >> >> if (anon_exclusive) { >> flush_cache_page(vma, addr, pte_pfn(*ptep)); >> ptep_clear_flush(vma, addr, ptep); > > Correct. > >> >> My understanding is that we need to do a flush for anon_exclusive. > > To me anon exclusive only shows this mm exclusively owns this page. I > didn't quickly figure out why that requires different handling on tlb > flushs. Did I perhaps miss something? GUP-fast is the magic bit, we have to make sure that we won't see new GUP pins, thus the TLB flush. include/linux/mm.h:gup_must_unshare() contains documentation. Without GUP-fast, some things would be significantly easier to handle. -- Thanks, David / dhildenb