On Wed, Oct 26, 2016 at 02:26:57PM -0700, Linus Torvalds wrote: > On Wed, Oct 26, 2016 at 1:31 PM, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote: > > > > IO wait activity is not all that matters. We hit the lock/unlock paths > > during a lot of operations like reclaim. > > I doubt we do. > > Yes, we hit the lock/unlock itself, but do we hit the *contention*? > > The current code is nasty, and always ends up touching the wait-queue > regardless of whether it needs to or not, but we have a fix for that. > To be clear, are you referring to PeterZ's patch that avoids the lookup? If so, I see your point. > With that fixed, do we actually get contention on a per-page basis? Reclaim would have to running parallel to migrations, faults, clearing write-protect etc. I can't think of a situation where a normal workload would hit it regularly and/or for long durations. > Because without contention, we'd never actually look up the wait-queue > at all. > > I suspect that without IO, it's really really hard to actually get > that contention, because things like reclaim end up looking at the LRU > queue etc wioth their own locking, so it should look at various > individual pages one at a time, not have multiple queues look at the > same page. > Except many direct reclaimers on small LRUs while a system is thrashing -- not a case that really matters, you've already lost. > >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > >> index 7f2ae99e5daf..0f088f3a2fed 100644 > >> --- a/include/linux/mmzone.h > >> +++ b/include/linux/mmzone.h > >> @@ -440,33 +440,7 @@ struct zone { > >> + int initialized; > >> > >> /* Write-intensive fields used from the page allocator */ > >> ZONE_PADDING(_pad1_) > > > > zone_is_initialized is mostly the domain of hotplug. A potential cleanup > > is to use a page flag and shrink the size of zone slightly. Nothing to > > panic over. > > I really did that to make it very obvious that there was no semantic > change. I just set the "initialized" flag in the same place where it > used to initialize the wait_table, so that this: > > >> static inline bool zone_is_initialized(struct zone *zone) > >> { > >> - return !!zone->wait_table; > >> + return zone->initialized; > >> } > > ends up being obviously equivalent. > No problem with that. > >> +#define WAIT_TABLE_BITS 8 > >> +#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS) > >> +static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned; > >> + > >> +wait_queue_head_t *bit_waitqueue(void *word, int bit) > >> +{ > >> + const int shift = BITS_PER_LONG == 32 ? 5 : 6; > >> + unsigned long val = (unsigned long)word << shift | bit; > >> + > >> + return bit_wait_table + hash_long(val, WAIT_TABLE_BITS); > >> +} > >> +EXPORT_SYMBOL(bit_waitqueue); > >> + > > > > Minor nit that it's unfortunate this moved to the scheduler core. It > > wouldn't have been a complete disaster to add a page_waitqueue_init() or > > something similar after sched_init. > > I considered that, but decided that "minimal patch" was better. Plus, > with that bit_waitqueue() actually also being used for the page > locking queues (which act _kind of_ but not quite, like a bitlock), > the bit_wait_table is actually more core than just the bit-wait code. > > In fact, I considered just renaming it to "hashed_wait_queue", because > that's effectively how we use it now, rather than being particularly > specific to the bit-waiting. But again, that would have made the patch > bigger, which I wanted to avoid since this is a post-rc2 thing due to > the gfs2 breakage. > No objection. Shuffling it around does not make it obviously better in any way. In the meantime, a machine freed up. FWIW, it survived booting on a 2-socket and about 20 minutes of bashing on reclaim paths from multiple processes to beat on lock/unlock. I didn't do a performance comparison or gather profile data but I wouldn't expect anything interesting from profiles other than some cycles saved. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>