Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: > On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote: >> Therefore, IIUC, try_to_umap_one() should only call >> mmu_notifier_invalidate_range() after ptep_get_and_clear() and > > That would trigger an unnecessarily double call to > ->invalidate_range() both from mmu_notifier_invalidate_range() after > ptep_get_and_clear() and later at the end in > mmu_notifier_invalidate_range_end(). > > The only advantage of adding a mmu_notifier_invalidate_range() after > ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind > the pagetables have to be shared with the primary MMU in order to use > the ->invalidate_range()) inside the PT lock. > > So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is > needed or not, again boils down to the issue if the old code calling > ->invalidate_page outside the PT lock was always broken before or > not. I don't see why exactly it was broken, we even hold the page lock > there so I don't see a migration race possible either. Of course the > constraint to be safe is that the put_page in try_to_unmap_one cannot > be the last one, and that had to be enforced by the caller taking an > additional reference on it. > > One can wonder if the primary MMU TLB flush in ptep_clear_flush > (should_defer_flush returning false) could be put after releasing the > PT lock too (because that's not different from deferring the secondary > MMU notifier TLB flush in ->invalidate_range down to > mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set, > which may be safe too for the same reasons. > > When should_defer_flush returns true we already defer the primary MMU > TLB flush to much later to even mmu_notifier_invalidate_range_end, not > just after the PT lock release so at least when should_defer_flush is > true, it looks obviously safe to defer the secondary MMU TLB flush to > mmu_notifier_invalidate_range_end for the drivers implementing > ->invalidate_range. Agreed. It just seems as an enhancement and not directly a bug fix. > > If I'm wrong and all TLB flushes must happen inside the PT lock, then > we should at least reconsider the implicit call to ->invalidate_range > method from mmu_notifier_invalidate_range_end or we would call it > twice unnecessarily which doesn't look optimal. Either ways this > doesn't look optimal. We would need to introduce a > mmu_notifier_invalidate_range_end_full that calls also > ->invalidate_range in such case so we skip the ->invalidate_range call > in mmu_notifier_invalidate_range_end if we put an explicit > mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT > lock like you suggested above. It is not trivial to flush TLBs (primary or secondary) without holding the page-table lock, and as we recently encountered this resulted in several bugs [1]. The main problem is that even if you perform the TLB flush immediately after the PT-lock is released, you cause a situation in which other threads may make decisions based on the updated PTE value, without being aware that a TLB flush is needed. For example, we recently encountered a Linux bug when two threads run MADV_DONTNEED concurrently on the same address range [2]. One of the threads may update a PTE, setting it as non-present, and then deferring the TLB flush (for batching). As a result, however, it would cause the second thread, which also changes the PTEs to assume that the PTE is already non-present and TLB flush is not necessary. As a result the second core may still hold stale PTEs in its TLB following MADV_DONTNEED. There is a swarm of such problems, and some are not too trivial. Deferring TLB flushes needs to be done in a very careful manner. [1] https://marc.info/?t=149973438800002&r=1 [2] http://www.spinics.net/lists/linux-mm/msg131525.html -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>