Also this new patch is working fine and fixing the problem. Mel, I cannot run your script: # perl watch-highorder-latency.pl Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17. # ls -ld /sys/kernel/debug/ ls: cannot access /sys/kernel/debug/: No such file or directory azur ______________________________________________________________ > Od: "Mel Gorman" <mel@xxxxxxxxx> > Komu: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > DÃtum: 14.04.2011 12:25 > Predmet: Re: Regression from 2.6.36 > > CC: "Eric Dumazet" <eric.dumazet@xxxxxxxxx>, "Changli Gao" <xiaosuo@xxxxxxxxx>, "Am?rico Wang" <xiyou.wangcong@xxxxxxxxx>, "Jiri Slaby" <jslaby@xxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, linux-mm@xxxxxxxxx, linux-fsdevel@xxxxxxxxxxxxxxx, "Jiri Slaby" <jirislaby@xxxxxxxxx> >On Wed, Apr 13, 2011 at 02:16:00PM -0700, Andrew Morton wrote: >> On Wed, 13 Apr 2011 04:37:36 +0200 >> Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote: >> >> > Le mardi 12 avril 2011 __ 18:31 -0700, Andrew Morton a __crit : >> > > On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <xiaosuo@xxxxxxxxx> wrote: >> > > >> > > > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton >> > > > <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: >> > > > > >> > > > > It's somewhat unclear (to me) what caused this regression. >> > > > > >> > > > > Is it because the kernel is now doing large kmalloc()s for the fdtable, >> > > > > and this makes the page allocator go nuts trying to satisfy high-order >> > > > > page allocation requests? >> > > > > >> > > > > Is it because the kernel now will usually free the fdtable >> > > > > synchronously within the rcu callback, rather than deferring this to a >> > > > > workqueue? >> > > > > >> > > > > The latter seems unlikely, so I'm thinking this was a case of >> > > > > high-order-allocations-considered-harmful? >> > > > > >> > > > >> > > > Maybe, but I am not sure. Maybe my patch causes too many inner >> > > > fragments. For example, when asking for 5 pages, get 8 pages, and 3 >> > > > pages are wasted, then memory thrash happens finally. >> > > >> > > That theory sounds less likely, but could be tested by using >> > > alloc_pages_exact(). >> > > >> > >> > Very unlikely, since fdtable sizes are powers of two, unless you hit >> > sysctl_nr_open and it was changed (default value being 2^20) >> > >> >> So am I correct in believing that this regression is due to the >> high-order allocations putting excess stress onto page reclaim? >> > >This is very plausible but it would be nice to get confirmation on >what the size of the fdtable was to be sure. If it's big enough for >high-order allocations and it's a fork-heavy workload with memory >mostly in use, the fork() latencies could be getting very high. In >addition, each fork is potentially kicking kswapd awake (to rebalance >the zone for higher orders). I do not see CONFIG_COMPACTION enabled >meaning that if I'm right in that kswapd is awake and fork() is >entering direct reclaim, then we are lumpy reclaiming as well which >can stall pretty severely. > >> If so, then how large _are_ these allocations? This perhaps can be >> determined from /proc/slabinfo. They must be pretty huge, because slub >> likes to do excessively-large allocations and the system handles that >> reasonably well. >> > >I'd be interested in finding out the value of /proc/sys/fs/file-max and >the output of ulimit -n (max open files) for the main server is. This >should help us determine what the size of the fdtable is. > >> I suppose that a suitable fix would be >> >> >> From: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> >> >> Azurit reports large increases in system time after 2.6.36 when running >> Apache. It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc() >> to allocate fdmem if possible"). >> >> That patch caused the vfs to use kmalloc() for very large allocations and >> this is causing excessive work (and presumably excessive reclaim) within >> the page allocator. >> >> Fix it by falling back to vmalloc() earlier - when the allocation attempt >> would have been considered "costly" by reclaim. >> >> Reported-by: azurIt <azurit@xxxxxxxx> >> Cc: Changli Gao <xiaosuo@xxxxxxxxx> >> Cc: Americo Wang <xiyou.wangcong@xxxxxxxxx> >> Cc: Jiri Slaby <jslaby@xxxxxxx> >> Cc: Eric Dumazet <eric.dumazet@xxxxxxxxx> >> Cc: Mel Gorman <mel@xxxxxxxxx> >> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> >> --- >> >> fs/file.c | 17 ++++++++++------- >> 1 file changed, 10 insertions(+), 7 deletions(-) >> >> diff -puN fs/file.c~a fs/file.c >> --- a/fs/file.c~a >> +++ a/fs/file.c >> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /* >> */ >> static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list); >> >> -static inline void *alloc_fdmem(unsigned int size) >> +static void *alloc_fdmem(unsigned int size) >> { >> - void *data; >> - >> - data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN); >> - if (data != NULL) >> - return data; >> - >> + /* >> + * Very large allocations can stress page reclaim, so fall back to >> + * vmalloc() if the allocation size will be considered "large" by the VM. >> + */ >> + if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) { > >The reporter will need to retest this is really ok. The patch that was >reported to help avoided high-order allocations entirely. If fork-heavy >workloads are really entering direct reclaim and increasing fork latency >enough to ruin performance, then this patch will also suffer. How much >it helps depends on how big fdtable. > >> + void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN); >> + if (data != NULL) >> + return data; >> + } >> return vmalloc(size); >> } >> > >I'm attaching a primitive perl script that reports high-order allocation >latencies. I'd be interesting to see what the output of it looks like, >particularly when the server is in trouble if the bug reporter as the >time. > >-- >Mel Gorman >SUSE Labs > > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html