On Wed, May 02, 2012 at 08:23:09AM +0200, Johannes Weiner wrote: > On Wed, May 02, 2012 at 03:57:41AM +0200, Andrea Arcangeli wrote: > > On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote: > > > frequently used active page. Instead, for each refault with a > > > distance smaller than the size of the active list, we deactivate an > > > > Shouldn't this be the size of active list + size of inactive list? > > > > If the active list is 500M, inactive 500M and the new working set is > > 600M, the refault distance will be 600M, it won't be smaller than the > > size of the active list, and it won't deactivate the active list as it > > should and it won't be detected as working set. > > > > Only the refault distance bigger than inactive+active should not > > deactivate the active list if I understand how this works correctly. > > The refault distance is what's missing, not the full reuse frequency. > You ignore the 500M worth of inactive LRU time the page had in memory. > The distance in that scenario would be 100M, the time between eviction > and refault: > > +-----------------------------++-----------------------------+ > | || | > | inactive || active | > +-----------------------------++-----------------------------+ > +~~~~~~~------------------------------+ > | | > | new set | > +~~~~~~~------------------------------+ > ^ ^ > | | > | eviction > refault > > The ~~~'d part could fit into memory if the active list was 100M > smaller. Never mind, I see that the refault distance is only going to measure the amount of the new working set that spilled over the inactive list so it would only be set to 100M in the example. > > > @@ -1726,6 +1728,11 @@ zonelist_scan: > > > if ((alloc_flags & ALLOC_CPUSET) && > > > !cpuset_zone_allowed_softwall(zone, gfp_mask)) > > > continue; > > > + if ((alloc_flags & ALLOC_WMARK_LOW) && > > > + current->refault_distance && > > > + !workingset_zone_alloc(zone, current->refault_distance, > > > + &distance, &active)) > > > + continue; > > > /* > > > * When allocating a page cache page for writing, we > > > * want to get it from a zone that is within its dirty > > > > It's a bit hard to see how this may not run oom prematurely if the > > distance is always bigger, this is just an implementation question and > > maybe I'm missing a fallback somewhere where we actually allocate > > memory from whatever place in case no place is ideal. > > Sorry, this should be documented better. > > The ALLOC_WMARK_LOW check makes sure this only applies in the > fastpath. It will prepare reclaim with lruvec->shrink_active, then > wake up kswapd and retry the zonelist without this constraint. My point is this is going to change the semantics of ALLOC_WMARK_LOW to "return OOM randomly even if there's plenty of free memory" instead of "use only up to the low wmark". I see you want to wake kswapd and retry with the min wmark after that, but maybe it would be cleaner to have a new ALLOC_REFAULT_DISTANCE to avoid altering the meaning of ALLOC_WMARK_LOW. Then add a "|ALLOC_REFAULT_DISTANCE" to the parameter. It sounds simpler to keep controlling the wmark level checked with ALLOC_WMARK_LOW|MIN|HIGH without introducing a new special meanings to the LOW bitflag. This is only a cleanup though, I believe it works good at runtime. > > > + /* > > > + * Lower zones may not even be full, and free pages are > > > + * potential inactive space, too. But the dirty reserve is > > > + * not available to page cache due to lowmem reserves and the > > > + * kswapd watermark. Don't include it. > > > + */ > > > + zone_free = zone_page_state(zone, NR_FREE_PAGES); > > > + if (zone_free > zone->dirty_balance_reserve) > > > + zone_free -= zone->dirty_balance_reserve; > > > + else > > > + zone_free = 0; > > > > Maybe also remove the high wmark from the sum? It can be some hundred > > meg so it's better to take it into account, to have a more accurate > > math and locate the best zone that surely fits. > > > > For the same reason it looks like the lowmem reserve should also be > > taken into account, on the full sum. > > dirty_balance_reserve IS the sum of the high watermark and the biggest > lowmem reserve for a particular zone, see how it's calculated in > mm/page_alloc.c::calculate_totalreserve_pages(). > > nr_free - dirty_balance_reserve is the number of pages available to > page cache allocations without keeping kswapd alive or having to dip > into lowmem reserves. > > Or did I misunderstand you? No, that's all right then! I didn't realize dirty_balance_reserve accounts exactly for what I wrote above (high wmark and lowmem reserve). I've seen it used by page-writeback and I naively assumed it had to do with dirty pages levels, while it has absolutely nothing to do with writeback or any dirty memory level! Despite its quite misleading _dirty prefix :) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html