On Tue, Sep 25, 2012 at 10:12:07AM +0100, Mel Gorman wrote: > On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote: > > On Mon, 24 Sep 2012 10:39:38 +0100 > > Mel Gorman <mgorman@xxxxxxx> wrote: > > > > > On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote: > > > > > > > Also, what has to be done to avoid the polling altogether? eg/ie, zap > > > > a pageblock's PB_migrate_skip synchronously, when something was done to > > > > that pageblock which justifies repolling it? > > > > > > > > > > The "something" event you are looking for is pages being freed or > > > allocated in the page allocator. A movable page being allocated in block > > > or a page being freed should clear the PB_migrate_skip bit if it's set. > > > Unfortunately this would impact the fast path of the alloc and free paths > > > of the page allocator. I felt that that was too high a price to pay. > > > > We already do a similar thing in the page allocator: clearing of > > ->all_unreclaimable and ->pages_scanned. > > That is true but that is a simple write (shared cache line but still) to > a struct zone. Worse, now that you point it out, that's pretty stupid. It > should be checking if the value is non-zero before writing to it to avoid > a cache line bounce. > > Clearing the PG_migrate_skip in this path to avoid the need to ever pool is > not as cheap as it needs to > > set_pageblock_skip > -> set_pageblock_flags_group > -> page_zone > -> page_to_pfn > -> get_pageblock_bitmap > -> pfn_to_bitidx > -> __set_bit > > > But that isn't on the "fast > > path" really - it happens once per pcp unload. > > That's still an important enough path that I'm wary of making it fatter > and that only covers the free path. To avoid the polling, the allocation > side needs to be handled too. It could be shoved down into rmqueue() to > put it into a slightly colder path but still, it's a price to pay to keep > compaction happy. > > > Can we do something > > like that? Drop some hint into the zone without having to visit each > > page? > > > > Not without incurring a cost, but yes, t is possible to give a hint on when > PG_migrate_skip should be cleared and move away from that time-based hammer. > > First, we'd introduce a variant of get_pageblock_migratetype() that returns > all the bits for the pageblock flags and then helpers to extract either the > migratetype or the PG_migrate_skip. We already are incurring the cost of > get_pageblock_migratetype() so it will not be much more expensive than what > is already there. If there is an allocation or free within a pageblock that > as the PG_migrate_skip bit set then we increment a counter. When the counter > reaches some to-be-decided "threshold" then compaction may clear all the > bits. This would match the criteria of the clearing being based on activity. > > There are four potential problems with this > > 1. The logic to retrieve all the bits and split them up will be a little > convulated but maybe it would not be that bad. > > 2. The counter is a shared-writable cache line but obviously it could > be moved to vmstat and incremented with inc_zone_page_state to offset > the cost a little. > > 3. The biggested weakness is that there is not way to know if the > counter is incremented based on activity in a small subset of blocks. > > 4. What should the threshold be? > > The first problem is minor but the other three are potentially a mess. > Adding another vmstat counter is bad enough in itself but if the counter > is incremented based on a small subsets of pageblocks, the hint becomes > is potentially useless. Another idea is that we can add two bits(PG_check_migrate/PG_check_free) in pageblock_flags_group. In allocation path, we can set PG_check_migrate in a pageblock In free path, we can set PG_check_free in a pageblock. And they are cleared by compaction's scan like now. So we can discard 3 and 4 at least. Another idea is that let's cure it by fixing fundamental problem. Make zone's locks more fine-grained. As time goes by, system uses bigger memory but our lock of zone isn't scalable. Recently, lru_lock and zone->lock contention report isn't rare so i think it's good time that we move next step. How about defining struct sub_zone per 2G or 4G? so a zone can have several sub_zone as size and subzone can replace current zone's role and zone is just container of subzones. Of course, it's not easy to implement but I think someday we should go that way. Is it a really overkill? > > However, does this match what you have in mind or am I over-complicating > things? > > > > > > > > > > > ... > > > > > > > > > > +static void reset_isolation_suitable(struct zone *zone) > > > > > +{ > > > > > + unsigned long start_pfn = zone->zone_start_pfn; > > > > > + unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages; > > > > > + unsigned long pfn; > > > > > + > > > > > + /* > > > > > + * Do not reset more than once every five seconds. If allocations are > > > > > + * failing sufficiently quickly to allow this to happen then continually > > > > > + * scanning for compaction is not going to help. The choice of five > > > > > + * seconds is arbitrary but will mitigate excessive scanning. > > > > > + */ > > > > > + if (time_before(jiffies, zone->compact_blockskip_expire)) > > > > > + return; > > > > > + zone->compact_blockskip_expire = jiffies + (HZ * 5); > > > > > + > > > > > + /* Walk the zone and mark every pageblock as suitable for isolation */ > > > > > + for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { > > > > > + struct page *page; > > > > > + if (!pfn_valid(pfn)) > > > > > + continue; > > > > > + > > > > > + page = pfn_to_page(pfn); > > > > > + if (zone != page_zone(page)) > > > > > + continue; > > > > > + > > > > > + clear_pageblock_skip(page); > > > > > + } > > > > > > > > What's the worst-case loop count here? > > > > > > > > > > zone->spanned_pages >> pageblock_order > > > > What's the worst-case value of (zone->spanned_pages >> pageblock_order) :) > > Lets take an unlikely case - 128G single-node machine. That loop count > on x86-64 would be 65536. It'll be fast enough, particularly in this > path. > > -- > Mel Gorman > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html