On Thu, Mar 10, 2016 at 05:10:35PM +0100, Andrea Arcangeli wrote: > On Thu, Feb 25, 2016 at 10:50:14PM -0800, Hugh Dickins wrote: > > It's a useful suggestion from Gerald, and your THP rework may have > > brought us closer to being able to rely on RCU locking rather than > > IRQ disablement there; but you were right just to delete the comment, > > there are other reasons why fast GUP still depends on IRQs disabled. > > > > For example, see the fallback tlb_remove_table_one() in mm/memory.c: > > that one uses smp_call_function() sending IPI to all CPUs concerned, > > without waiting an RCU grace period at all. > > I full agree, the refcounting change just drops the THP splitting from > the equation, but everything else remains. It's not like x86 is using > RCU for gup_fast when CONFIG_TRANSPARENT_HUGEPAGE=n. > > The main issue Peter also pointed out is how it can be faster to wait > a RCU grace period than sending an IPI to only the CPU that have an > active_mm matching the one the page belongs to Typically RCU (sched) grace periods take a relative 'forever' compared to sending IPIs. That is, synchronize_sched() is typically slower. But, on the upside, not sending IPIs will not perturb those other CPUs, which is something HPC/RT people like. > and I'm not exactly > sure the cost of disabling irqs in gup_fast is going to pay off. Entirely depends on the workload of course, but you can do a lot of gup_fast compared to munmap()s. So making gup_fast, faster, seems like a potential win. Also, is anybody really interested in munmap() performance? > It's > not just swap, large munmap should be able to free up pagetables or > pagetables would get a footprint out of proportion with the Rss of the > process, and in turn it'll have to either block synchronously for long > before returning to userland, or return to userland when the pagetable > memory is still not free, and userland may mmap again and munmap again > in a loop and being legit doing so too, with unclear side effects with > regard to false positive OOM. I'm not seeing that, the only point where this matters at all, is if the batch alloc fails, otherwise the RCU_TABLE_FREE stuff uses call_rcu_sched() and what you write above is true already. Now, RCU already has an oom_notifier to push work harder if we approach that. > Then there's another issue with synchronize_sched(), > __get_user_pages_fast has to safe to run from irq (note the > local_irq_save instead of local_irq_disable) and KVM leverages it. This is unchanged. synchronize_sched() serialized against anything that disables preemption, having IRQs disabled is very much included in that. So there should be no problem running this from IRQ context. > KVM > just requires it to be atomic so it can run from inside a preempt > disabled section (i.e. inside a spinlock), I'm fairly certain the > irq-safe guarantee could be dropped without pain and > rcu_read_lock_sched() would be enough, but the documentation of the > IRQ-safe guarantees provided by __get_user_pages_fast should be also > altered if we were to use synchronize_sched() and that's a symbol > exported to GPL modules too. No changes needed. > Overall my main concern in switching x86 to RCU gup-fast is the > performance of synchronize_sched in large munmap pagetable teardown. Normally, as already established by Martin, you should not actually ever encounter the sync_sched() call. Only under severe memory pressure, when the batch alloc in tlb_remove_table() fails is this ever an issue. And at the point where such allocations fail, performance typically isn't a concern anymore. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>