Hi Andrew, thanks for your review, I tried to answer your questions below. On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote: > On Tue, 9 Sep 2014 17:43:51 +0200 Joerg Roedel <joro@xxxxxxxxxx> wrote: > > So both call-backs can't be used to safely flush any non-CPU > > TLB because _start() is called too early and _end() too > > late. > > There's a lot of missing information here. Why don't the existing > callbacks suit non-CPU TLBs? What is different about them? Please > update the changelog to contain all this context. The existing call-backs are called too early or too late. Specifically, invalidate_range_start() is called when all pages are still mapped and invalidate_range_end() when all pages are unmapped and potentially freed. This is fine when the users of the mmu_notifiers manage their own SoftTLB, like KVM does. When the TLB is managed in software it is easy to wipe out entries for a given range and prevent new entries to be established until invalidate_range_end is called. But when the user of mmu_notifiers has to manage a hardware TLB it can still wipe out TLB entries in invalidate_range_start, but it can't make sure that no new TLB entries in the given range are established between invalidate_range_start and invalidate_range_end. [ Actually the current AMD IOMMUv2 code tries to do that with setting the setting an empty page-table for the non-CPU TLB, but this causes address translation errors which end up in device failures. ] But to avoid silent data corruption the TLB entries need to be flushed out of the non-CPU hardware TLB when the pages are unmapped (at this point in time no _new_ TLB entries can be established in the non-CPU TLB) but not yet freed (as the non-CPU TLB may still have _existing_ entries pointing to the pages about to be freed). So to fix this problem we need to catch the moment when the Linux VMM flushes remote TLBs (as a non-CPU TLB is not very different in its flushing requirements from any other remote CPU TLB), as this is the point in time when the pages are unmapped but _not_ yet freed. The mmu_notifier_invalidate_range() function aims to catch that moment. > > In the AMD IOMMUv2 driver this is currently implemented by > > assigning an empty page-table to the external device between > > _start() and _end(). But as tests have shown this doesn't > > work as external devices don't re-fault infinitly but enter > > a failure state after some time. > > More missing info. Why are these faults occurring? Is there some > device activity which is trying to fault in pages, but the CPU is > executing code between _start() and _end() so the driver must refuse to > instantiate a page to satisfy the fault? That's just a guess, and I > shouldn't be guessing. Please update the changelog to fully describe > the dynamic activity which is causing this. The device (usually a GPU) runs some process (for example a compute job) that directly accesses a Linux process address space. Any access to a process address space can cause a page-fault, whether the access comes from the CPU or a remote device. When the page-fault comes from a compute job running on a GPU, is is reported to Linux by an IOMMU interrupt. The current implementation of invalidate_range_start/end assigns an empty page-table, which causes many page-faults from the GPU process, resulting in an interrupt storm for the IOMMU. The fault handler doesn't handle the fault if an invalidate_range_start/end pair is active, but just reports back SUCESS to the device to let it refault the page then (refaulting is the same strategy KVM implements). But existing GPUs that make use of this feature don't refault indefinitly, after a certain number of faults for the same address the device enters a failure state and needs to be resetted.L > > Next problem with this solution is that it causes an > > interrupt storm for IO page faults to be handled when an > > empty page-table is assigned. > > Also too skimpy. I *think* this is a variant of the problem in the > preceding paragraph. We get a fault storm (which is problem 2) and > sometimes the faulting device gives up (which is problem 1). > > Or something. Please de-fog all of this. Right, I will update the description to be more clear. > > Furthermore the _start()/end() notifiers only catch the > > moment when page mappings are released, but not page-table > > pages. But this is necessary for managing external TLBs when > > the page-table is shared with the CPU. > > How come? As mmu_notifiers are not used for managing TLBs that share the same page-table as the CPU uses, there was not need to catch the page-table freeing events, so it is not available yet. > > Any comments and review appreciated! > > The patchset looks decent, although I find it had to review because I > just wasn't provided with enough of the thinking that went into it. I > have enough info to look at the C code, but not enough info to identify > and evaluate alternative implementation approaches, to identify > possible future extensions, etc. Fair enough, I hope I clarified a few things with my explanations above. I will also update the description of the patch-set when I re-send. > The patchset does appear to add significant additional overhead to hot > code paths when mm_has_notifiers(mm). Please let's update the > changelog to address this rather important concern. How significant is > the impact on such mm's, how common are such mm's now and in the > future, should we (for example) look at short-circuiting > __mmu_notifier_invalidate_range() if none of the registered notifiers > implement ->invalidate_range(), etc. I think it adds the most overhead to single-CPU kernels. The invalidate_range notifier is called in the paths that also do a remote TLB flush, which is very expensive on its own. To those paths it adds just another remote TLB that needs to be flushed. Regards, Joerg -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>