On Tue, Mar 17, 2015 at 1:51 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On the -o ag_stride=-1 -o bhash=101073 config, the 60s perf stat I > was using during steady state shows: > > 471,752 migrate:mm_migrate_pages ( +- 7.38% ) > > The migrate pages rate is even higher than in 4.0-rc1 (~360,000) > and 3.19 (~55,000), so that looks like even more of a problem than > before. Hmm. How stable are those numbers boot-to-boot? That kind of extreme spread makes me suspicious. It's also interesting that if the numbers really go up even more (and by that big amount), then why does there seem to be almost no correlation with performance (which apparently went up since rc1, despite migrate_pages getting even _worse_). > And the profile looks like: > > - 43.73% 0.05% [kernel] [k] native_flush_tlb_others Ok, that's down from rc1 (67%), but still hugely up from 3.19 (13.7%). And flush_tlb_page() does seem to be called about ten times more (flush_tlb_mm_range used to be 1.4% of the callers, now it's invisible at 0.13%) Damn. From a performance number standpoint, it looked like we zoomed in on the right thing. But now it's migrating even more pages than before. Odd. > And the vmstats are: > > 3.19: > > numa_hit 5163221 > numa_local 5153127 > 4.0-rc1: > > numa_hit 36952043 > numa_local 36927384 > > 4.0-rc4: > > numa_hit 23447345 > numa_local 23438564 > > Page migrations are still up by a factor of ~20 on 3.19. The thing is, those "numa_hit" things come from the zone_statistics() call in buffered_rmqueue(), which in turn is simple from the memory allocator. That has *nothing* to do with virtual memory, and everything to do with actual physical memory allocations. So the load is simply allocating a lot more pages, presumably for those stupid migration events. But then it doesn't correlate with performance anyway.. Can you do a simple stupid test? Apply that commit 53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic") to 3.19, so that it uses the same "pte_dirty()" logic as 4.0-rc4. That *should* make the 3.19 and 4.0-rc4 numbers comparable. It does make me wonder if your load is "chaotic" wrt scheduling. The load presumably wants to spread out across all cpu's, but then the numa code tries to group things together for numa accesses, but depending on just random allocation patterns and layout in the hash tables, there either are patters with page access or there aren't. Which is kind of why I wonder how stable those numbers are boot to boot. Maybe this is at least partly about lucky allocation patterns. Linus _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs