On Wed, 28 Sep 2016 00:53:18 +1000 Nicholas Piggin <npiggin@xxxxxxxxx> wrote: > On Tue, 27 Sep 2016 09:31:04 +0100 > Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote: > > > Hi Linus, > > > > On Mon, Sep 26, 2016 at 01:58:00PM -0700, Linus Torvalds wrote: > > > So I've been doing some profiling of the git "make test" load, which > > > is interesting because it's just a lot of small scripts, and it shows > > > our fork/execve/exit costs. > > > > > > Some of the top offenders are pretty understandable: we have long had > > > unmap_page_range() show up at the top on these loads, because the > > > zap_pte_range() function ends up touching each page as it unmaps > > > things (it does it to check whether it's an anonymous page, but then > > > also for the page map count update etc), and that's a disaster from a > > > cache standpoint. That single function is something between 3-4% of > > > CPU time, and the one instruction that accesses "struct page" the > > > first time is a large portion of that. Yes, a single instruction is > > > blamed for about 1% of all CPU time on a fork/exec/exit workload. > > > > > > > It was found at one point that the fault-around made these costs worse as > > there were simply more pages to tear down. However, this only applied to > > fork/exit microbenchmarks. Matt Fleming prototyped an unreleased patch > > that tried to be clever about this but the cost was never worthwhile. A > > plain revert helped a microbenchmark but hurt workloads like the git > > testsuite which was shell intensive. > > > > It got filed under "we're not fixing a fork/exit microbenchmark at the > > expense of "real" workloads like git checkout and git testsuite". > > > > > <SNIP> > > > > > > #5 and #6 on my profile are user space (_int_malloc in glibc, and > > > do_lookup_x in the loader - I think user space should probably start > > > thinking more about doing static libraries for the really core basic > > > things, but whatever. Not a kernel issue. > > > > > > > Recent problems have been fixed with _int_malloc in glibc, particularly as it > > applies to threads but no fix springs to mind that might impact "make test". > > > > > #7 is in the kernel again. And that one struck me as really odd. It's > > > "unlock_page()", while #9 is __wake_up_bit(). WTF? There's no IO in > > > this load, it's all cached, why do we use 3% of the time (1.7% and > > > 1.4% respectively) on unlocking a page. And why can't I see the > > > locking part? > > > > > > It turns out that I *can* see the locking part, but it's pretty cheap. > > > It's inside of filemap_map_pages(), which does a trylock, and it shows > > > up as about 1/6th of the cost of that function. Still, it's much less > > > than the unlocking side. Why is unlocking so expensive? > > > > > > Yeah, unlocking is expensive because of the nasty __wake_up_bit() > > > code. In fact, even inside "unlock_page()" itself, most of the costs > > > aren't even the atomic bit clearing (like you'd expect), it's the > > > inlined part of wake_up_bit(). Which does some really nasty crud. > > > > > > Why is the page_waitqueue() handling so expensive? Let me count the ways: > > > > > > > page_waitqueue() has been a hazard for years. I think the last attempt to > > fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html > > > > The patch is heavily derived from work by Nick Piggin who noticed the years > > before that. I think that was the last version I posted and the changelog > > includes profile data. I don't have an exact reference why it was rejected > > but a consistent piece of feedback was that it was very complex for the > > level of impact it had. > > Huh, I was just wondering about this again the other day. Powerpc has > some interesting issues with atomic ops and barriers (not to mention > random cache misses that hurt everybody). > > It actually wasn't for big Altix machines (at least not when I wrote it), > but it came from some effort to optimize page reclaim performance on an > opteron with a lot (back then) of cores per node. > > And it's not only for scalability, it's a single threaded performance > optimisation as much as anything. > > By the way I think that patch linked is taking the wrong approach. Better > to put all the complexity of maintaining the waiters bit into the sleep/wake > functions. The fastpath simply tests the bit in no less racy a manner than > the unlocked waitqueue_active() test. Really incomplete patch attached for > reference. > > The part where we hack the wait code into maintaining the extra bit for us > is pretty mechanical and boring so long as it's under the waitqueue lock. > > The more interesting is the ability to avoid the barrier between fastpath > clearing a bit and testing for waiters. > > unlock(): lock() (slowpath): > clear_bit(PG_locked) set_bit(PG_waiter) > test_bit(PG_waiter) test_bit(PG_locked) > > If this was memory ops to different words, it would require smp_mb each > side.. Being the same word, can we avoid them? ISTR Linus you were worried > about stores being forwarded to loads before it is visible to the other CPU. > I think that should be okay because the stores will be ordered, and the load > can't move earlier than the store on the same CPU. Maybe I completely > misremember it. > > > Subject: [PATCH] blah > > --- > include/linux/pagemap.h | 10 +++--- > mm/filemap.c | 92 +++++++++++++++++++++++++++++++++++++++++++------ > 2 files changed, 87 insertions(+), 15 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index 66a1260..a536df9 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -479,10 +479,11 @@ static inline int wait_on_page_locked_killable(struct page *page) > return wait_on_page_bit_killable(compound_head(page), PG_locked); > } > > -extern wait_queue_head_t *page_waitqueue(struct page *page); > static inline void wake_up_page(struct page *page, int bit) > { > - __wake_up_bit(page_waitqueue(page), &page->flags, bit); > + if (!PageWaiters(page)) > + return; > + wake_up_page_bit(page, bit); > } > > /* > @@ -494,8 +495,9 @@ static inline void wake_up_page(struct page *page, int bit) > */ > static inline void wait_on_page_locked(struct page *page) > { > - if (PageLocked(page)) > - wait_on_page_bit(compound_head(page), PG_locked); > + if (!PageLocked(page)) > + return 0; > + wait_on_page_bit(compound_head(page), PG_locked); > } > > /* > diff --git a/mm/filemap.c b/mm/filemap.c > index 8a287df..09bca8a 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -775,24 +775,98 @@ EXPORT_SYMBOL(__page_cache_alloc); > * at a cost of "thundering herd" phenomena during rare hash > * collisions. > */ > -wait_queue_head_t *page_waitqueue(struct page *page) > +static wait_queue_head_t *page_waitqueue(struct page *page) > { > const struct zone *zone = page_zone(page); > > return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)]; > } > -EXPORT_SYMBOL(page_waitqueue); > + > +struct wait_page_key { > + struct page *page; > + int bit_nr; > + int page_match; > +}; > + > +struct wait_page_queue { > + struct wait_page_key key; > + wait_queue_t wait; > +}; > + > +static int wake_page_function(wait_queue_t *wait, unsigned mode, int sync, void *arg) > +{ > + struct wait_page_queue *wait = container_of(wait, struct wait_page_queue, wait); > + struct wait_page_key *key = arg; > + int ret; > + > + if (wait->key.page != key->page) > + return 0; > + key->page_match = 1; > + if (wait->key.bit_nr != key->bit_nr || > + test_bit(key->bit_nr, &key->page->flags)) > + return 0; > + > + ret = try_to_wake_up(wait->wait.private, mode, sync); > + if (ret) > + list_del_init(&wait->task_list); > + return ret; > +} > > void wait_on_page_bit(struct page *page, int bit_nr) > { > - DEFINE_WAIT_BIT(wait, &page->flags, bit_nr); > + wait_queue_head_t *wq = page_waitqueue(page); > + struct wait_page_queue wait; > > - if (test_bit(bit_nr, &page->flags)) > - __wait_on_bit(page_waitqueue(page), &wait, bit_wait_io, > - TASK_UNINTERRUPTIBLE); > + init_wait(&wait.wait); > + wait.wait.func = wake_page_function; > + wait.key.page = page; > + wait.key.bit_nr = bit_nr; > + /* wait.key.page_match unused */ > + > + wait.flags &= ~WQ_FLAG_EXCLUSIVE; > + > +again: > + spin_lock_irq(&wq->lock); > + if (unlikely(!test_bit(bit_nr, &page->flags))) { > + spin_unlock_irq(&wq->lock); > + return; > + } > + > + if (list_empty(&wait->task_list)) { > + __add_wait_queue(wq, &wait); > + if (!PageWaiters(page)) > + SetPageWaiters(page); > + } Ugh, I even wrote the correct ordering in the email, but did it the wrong way here. Anyway, you get the idea. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>