Re: Race condition observed between page migration and page fault handling on arm64 machines

Dev Jain <dev.jain@xxxxxxx> · Wed, 7 Aug 2024 18:28:57 +0530

On 8/7/24 17:09, David Hildenbrand wrote:
On 05.08.24 16:14, Dev Jain wrote:

On 8/5/24 16:16, David Hildenbrand wrote:
On 05.08.24 11:51, Dev Jain wrote:

On 8/1/24 19:18, David Hildenbrand wrote:
On 01.08.24 15:43, Will Deacon wrote:
On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote:
On 01.08.24 15:13, David Hildenbrand wrote:
To dampen the tradeoff, we could do this in shmem_fault()
instead? But
then, this would mean that we do this in all

kinds of vma->vm_ops->fault, only when we discover another
reference
count race condition :) Doing this in do_fault()

should solve this once and for all. In fact, do_pte_missing()
may call
do_anonymous_page() or do_fault(), and I just

noticed that the former already checks this using
vmf_pte_changed().

What I am still missing is why this is (a) arm64 only; and 
(b) if
this
is something we should really worry about. There are other 
reasons
(e.g., speculative references) why migration could temporarily
fail,
does it happen that often that it is really something we have to
worry
about?


(a) See discussion at [1]; I guess it passes on x86, which is 
quite
strange since the race is clearly arch-independent.

Yes, I think this is what we have to understand. Is the race 
simply
less
likely to trigger on x86?

I would assume that it would trigger on any arch.

I just ran it on a x86 VM with 2 NUMA nodes and it also seems to
work here.

Is this maybe related to deferred flushing? Such that the other 
CPU
will
by accident just observe the !pte_none a little less likely?

But arm64 also usually defers flushes, right? At least unless
ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do
deferred
flushes.

Bingo!

diff --git a/mm/rmap.c b/mm/rmap.c
index e51ed44f8b53..ce94b810586b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct
mm_struct
*mm, pte_t pteval,
     */
    static bool should_defer_flush(struct mm_struct *mm, enum
ttu_flags flags)
    {
-       if (!(flags & TTU_BATCH_FLUSH))
-               return false;
-
-       return arch_tlbbatch_should_defer(mm);
+       return false;
    }


On x86:

# ./migration
TAP version 13
1..1
# Starting 1 tests from 1 test cases.
#  RUN           migration.shared_anon ...
Didn't migrate 1 pages
# migration.c:170:shared_anon:Expected migrate(ptr, self->n1,
self->n2) (-2)
== 0 (0)
# shared_anon: Test terminated by assertion
#          FAIL  migration.shared_anon
not ok 1 migration.shared_anon


It fails all of the time!

Nice work! I suppose that makes sense as, with the eager TLB
invalidation, the window between the other CPU faulting and the
migration entry being written is fairly wide.

Not sure about a fix though :/ It feels a bit overkill to add a new
invalid pte encoding just for this.

Something like that might make the test happy in most cases:

diff --git a/tools/testing/selftests/mm/migration.c
b/tools/testing/selftests/mm/migration.c
index 6908569ef406..4c18bfc13b94 100644
--- a/tools/testing/selftests/mm/migration.c
+++ b/tools/testing/selftests/mm/migration.c
@@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
          int ret, tmp;
          int status = 0;
          struct timespec ts1, ts2;
+       int errors = 0;

          if (clock_gettime(CLOCK_MONOTONIC, &ts1))
                  return -1;
@@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2)
                  ret = move_pages(0, 1, (void **) &ptr, &n2, 
&status,
                                  MPOL_MF_MOVE_ALL);
                  if (ret) {
-                       if (ret > 0)
+                       if (ret > 0) {
+                               if (++errors < 100)
+                                       continue;
                                  printf("Didn't migrate %d pages\n",
ret);
-                       else
+                       } else {
                                  perror("Couldn't migrate pages");
+                       }
                          return -2;
                  }
+               /* Progress! */
+               errors = 0;

                  tmp = n2;
                  n2 = n1;


[root@localhost mm]# ./migration
TAP version 13
1..1
# Starting 1 tests from 1 test cases.
#  RUN           migration.shared_anon ...
#            OK  migration.shared_anon
ok 1 migration.shared_anon
# PASSED: 1 / 1 tests passed.
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0


This does make the test pass, to my surprise, since what you are doing
from userspace

should have been done by the kernel, because it retries folio 
unmapping
and moving

NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up
these

macros and the original test was still failing. Now, I digged in more,
and, if the

following assertion is correct:


Any thread having a reference on a folio will end up calling
folio_lock()


Good point. I suspect concurrent things like read/write would also be
able to trigger this (did not check, though).


then it seems to me that the retry for loop wrapped around
migrate_folio_move(), inside

migrate_pages_batch(), is useless; if migrate_folio_move() fails on 
the
first iteration, it is

going to fail for all iterations since, if I am reading the code path
correctly, the only way it

fails is when the actual refcount is not equal to expected refcount 
(in
folio_migrate_mapping()),

and there is no way that the extra refcount is going to get released
since the migration path

has the folio lock.

And therefore, this begs the question: isn't it logical to assert the
actual refcount against the

expected refcount, just after we have changed the PTEs, so that if 
this
assertion fails, we can

go to the next iteration of the for loop for migrate_folio_unmap()
inside migrate_pages_batch()

by calling migrate_folio_undo_src()/dst() to restore the old state? 
I am
trying to implement

this but is not as straightforward as it seemed to me this morning.

I agree with your assessment that migration code currently doesn't
handle the case well when some other thread does an unconditional
folio_lock(). folio_trylock() users would be handled, but that's not
what we want with FGP_LOCK I assume.

So IIUC, your idea would be to unlock the folio in migration code and
try again their. Sounds reasonable, without looking into the details :)



BTW, I was trying to find the spot that would do the folio_lock(), but 
filemap_fault() does the lock_folio_maybe_drop_mmap() where we do a 
folio_trylock().

Where exactly is the folio_lock() on the fault path that would 
prohibit us from making progress?

Not filemap_fault(); it enters shmem_fault() which eventually calls 
shmem_get_folio_gfp(), retrieving the folio from the pagecache, and 
calling folio_lock().


(Adding Mel if at all he has any comments for a compaction use-case)

What I was trying to say is this (forgive me for the hard-coded value):

The hard-coded 2 is wrong indeed :)


diff --git a/mm/migrate.c b/mm/migrate.c
index a8c6f466e33a..404af46dd661 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1262,6 +1262,8 @@ static int migrate_folio_unmap(new_folio_t
get_new_folio,
}
if (!folio_mapped(src)) {
+ if (folio_ref_count(src) != 2)
+ goto out;
__migrate_folio_record(dst, old_page_state, anon_vma);
return MIGRATEPAGE_UNMAP;
}
This will give us more chances to win the race. On an average, now
the test fails on 100 iterations of move_pages(). If you multiply
the NR_MAX_PAGES_(A)SYNC_RETRY macros by 3, the average goes above
to 2000.
But if the consensus is that this is just pleasing the test without
any real use-case (compaction?) then I guess I am alright with making
the change in the test.

I'd be curious if other activity (besides fault, like concurrent 
read()/write()/...) can similarly make migration fail. I suspect it can.