Re: Race condition observed between page migration and page fault handling on arm64 machines

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



To dampen the tradeoff, we could do this in shmem_fault() instead? But
then, this would mean that we do this in all

kinds of vma->vm_ops->fault, only when we discover another reference
count race condition :) Doing this in do_fault()

should solve this once and for all. In fact, do_pte_missing() may call
do_anonymous_page() or do_fault(), and I just

noticed that the former already checks this using vmf_pte_changed().

What I am still missing is why this is (a) arm64 only; and (b) if this
is something we should really worry about. There are other reasons
(e.g., speculative references) why migration could temporarily fail,
does it happen that often that it is really something we have to worry
about?


(a) See discussion at [1]; I guess it passes on x86, which is quite
strange since the race is clearly arch-independent.

Yes, I think this is what we have to understand. Is the race simply less likely to trigger on x86?

I would assume that it would trigger on any arch.

I just ran it on a x86 VM with 2 NUMA nodes and it also seems to work here.

Is this maybe related to deferred flushing? Such that the other CPU will by accident just observe the !pte_none a little less likely?

But arm64 also usually defers flushes, right? At least unless ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred flushes.


(b) On my machine, on an average in under 10 iterations of move_pages(),
it fails, which seems problematic to

Yes, it's a big difference compared to what I encounter.

--
Cheers,

David / dhildenb





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux