On Tue, Feb 23, 2016 at 09:58:59PM +0000, Mel Gorman wrote: > On Tue, Feb 23, 2016 at 12:59:15PM -0800, Johannes Weiner wrote: > > The problem is that kswapd will stay awake and continuously draw > > subsequent allocations into a single zone, thus utilizing only a > > fraction of available memory. > > Not quite. Look at prepare_kswapd_sleep() in the full series and it has this > > > for (i = 0; i <= classzone_idx; i++) { > struct zone *zone = pgdat->node_zones + i; > > if (!populated_zone(zone)) > continue; > > if (zone_balanced(zone, order, 0, classzone_idx)) > return true; > } > > and balance_pgdat has this > > /* Only reclaim if there are no eligible zones */ > for (i = classzone_idx; i >= 0; i--) { > zone = pgdat->node_zones + i; > if (!populated_zone(zone)) > continue; > > if (!zone_balanced(zone, order, 0, classzone_idx)) { > classzone_idx = i; > break; > } > } > > kswapd only stays awake until *one* balanced zone is available. That is > a key difference with the existing kswapd which balances all zones. Thanks for clarifying, that is a good point. I applied the full series now locally and the final code is indeed much easier to understand. > > Sure, it doesn't matter in that benchmark, because the pages are used > > only once. But if it had an actual cache workingset bigger than DMA32 > > but smaller than DMA32+Normal, it would be thrashing unnecessarily. > > > > If kswapd were truly balancing the pages in a node equally, regardless > > of zone placement, then in the long run we should see zone allocations > > converge to a share that is in proportion to each zone's size. As far > > as I can see, that is not quite happening yet. > > > > Not quite either. The order kswapd reclaims is in related to the age of > all pages in the node. Early in the lifetime of the system, that may be > ZONE_NORMAL initially until the other zones are populated. Ultimately > the balance of zones will be related to the age of the pages. Thanks again. Yes, the picture is finally clicking into place for me. > > > > If reclaim can't guarantee a balanced zone utilization then the > > > > allocator has to keep doing it. :( > > > > > > That's the key issue - the main reason balanced zone utilisation is > > > necessary is because we reclaim on a per-zone basis and we must avoid > > > page aging anomalies. If we balance such that one eligible zone is above > > > the watermark then it's less of a concern. > > > > Yes, but only if there can't be extended reclaim stretches that prefer > > the pages of a single zone. Yet it looks like this is still possible. > > And that is a problem if a workload is dominated by allocations > requiring the lower zones. If that is the common case then it's a bust > and fair zone allocation policy is still required. That removes one > motivation from the series as it leaves some fatness in the page > allocator paths. With your above explanations, I'm now much more confident this series is doing the right thing. Thanks. The uncertainty over low-zone allocation floods is real, but what is also unsettling is that, where the fair zone code used to shield us from kswapd changes, we now open ourselves up to subtle aging bugs, which are no longer detectable via the zone placement statistics. And we have changed kswapd around quite extensively in the recent past. A good metric for aging distortion might be able to mitigate both these things. Something to keep an eye on when making changes to kswapd, or when analyzing performance problems with a workload. What I have in mind is per-classzone counters of reclaim work. If we had exact numbers on how much zone-restricted reclaim is being done relative to unrestricted scans, we could know how severely the aging process is being distorted under any given workload. That would allow us to validate these changes here, future kswapd and allocator changes, and help us identify problematic workloads. And maybe we can change the now useless pgalloc_ stats from counting zone placement to counting allocation requests by classzone. We could then again correlate the number of requests to the amount of work done. A high amount of restricted reclaim on behalf of mostly Normal allocation requests would detect the bug I described above, e.g. And we could generally tell how expensive restricted allocations are in the new node-LRUs. What do you think? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>