On Thu, Nov 25, 2010 at 04:33:01PM +0100, Peter Sch??ller wrote: > > simple thing to do in any case. ??You can watch the entries in slabinfo > > and see if any of the ones with sizes over 4096 bytes are getting used > > often. ??You can also watch /proc/buddyinfo and see how often columns > > other than the first couple are moving around. > > I collected some information from > /proc/{buddyinfo,meminfo,slabinfo,vmstat} and let it sit, polling > approximately once per minute. I have some results correlated with > another page eviction in graphs. The graph is here: > > http://files.spotify.com/memcut/memgraph-20101124.png > > The last sudden eviction there occurred somewhere between 22:30 and > 22:45. Some URL:s that can be compared for those periods: > > Before: > http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/vmstat > http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/buddyinfo > http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/meminfo > http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/slabinfo > > After: > http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/vmstat > http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/buddyinfo > http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/meminfo > http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/slabinfo Disclaimer: I have no idea what I'm doing. :) Your buddyinfo looks to be pretty low for order 3 and above, before and after the sudden eviction, so my guess is that it's probably related to the issues I'm seeing with fragmentation, but maybe not fighting between zones, since you seem to have a larger Normal zone than DMA32. (Not sure, you didn't post /proc/zoneinfo). Also, you seem to be on an actual NUMA system, so other things are happening there, too. If you have munin installed (it looks like you do), try enabling the buddyinfo plugin available since munin 1.4.4. It graphs the buddyinfo data, so it could be lined up with the memory graphs (thanks Erik). [snip] > kmalloc increases: > > -kmalloc-4096 301 328 4096 8 8 : tunables 0 0 > 0 : slabdata 41 41 0 > +kmalloc-4096 637 680 4096 8 8 : tunables 0 0 > 0 : slabdata 85 85 0 > -kmalloc-2048 18215 19696 2048 16 8 : tunables 0 0 > 0 : slabdata 1231 1231 0 > +kmalloc-2048 41908 51792 2048 16 8 : tunables 0 0 > 0 : slabdata 3237 3237 0 > -kmalloc-1024 85444 97280 1024 32 8 : tunables 0 0 > 0 : slabdata 3040 3040 0 > +kmalloc-1024 267031 327104 1024 32 8 : tunables 0 0 > 0 : slabdata 10222 10222 0 Note that all of the above are actually attempting order-3 allocations first; see /sys/kernel/slab/kmalloc-1024/order, for instance. The "8" is means "8 pages per slab", which means order 3 is the attempted allocation size. I did the following on a system to test, but the free memory did not actually improve. It seems that even only order 1 allocations are enough to reclaim too much order 0. Even a "while true; sleep .01; done" caused free memory to start increasing due to order 1 (task_struct allocation) watermarks waking kswapd, while our other usual VM activity is happening. #!/bin/bash for i in /sys/kernel/slab/*/; do if [ `cat $i/object_size` -le 4096 ]; then echo 0 > $i/order else echo 1 > $i/order fi done But this is on another machine, without Mel's patch, and with 8 GB memory, so a bigger Normal zone. [snip] > If my interpretation and understanding is correct, this indicates that > for example, ~3000 to ~10000 3-order allocations resulting from 1 kb > kmalloc():s. Meaning about 0.2 gig ( 7000*4*8*1024/1024/1024). Add the > other ones and we get some more, but only a few hundred megs in total. > > Going by the hypothesis that we are seeing the same thing as reported > by Simon Kirby (I'll respond to that E-Mail separately), the total > amount is (as far as I understand) not the important part, but the > fact that we saw a non-trivial increase in 3-order allocations would > perhaps be a consistent observation in that frequent 3-order > allocations might be more likely to trigger the behavior Simon > reports. Try installing the "perf" tool. It can be built from the kernel tree in tools/perf, and then you usually can just copy the binary around. You can use it to trace the points which cause kswapd to wake up, which will show which processes are doing it, the order, flags, etc. Just before the eviction is about to happen (or whenever), try this: perf record --event vmscan:mm_vmscan_wakeup_kswapd --filter 'order>=3' \ --call-graph -a sleep 30 Then view the recorded events with "perf trace", which should spit out something like this: lmtp-3531 [003] 432339.243851: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3 lmtp-3531 [003] 432339.243856: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3 The process which woke kswapd may not be directly responsible for the allocation as a network interrrupt or something could have happened on top of it. See "perf report", which is a bit dodgy at least for me, to see the stack traces, which might make things clearer. For example, my traces show that even kswapd wakes kswapd sometimes, but it's because of a trace like this: - 9.09% kswapd0 [kernel.kallsyms] [k] wakeup_kswapd wakeup_kswapd __alloc_pages_nodemask alloc_pages_current new_slab __slab_alloc __kmalloc_node_track_caller __alloc_skb __netdev_alloc_skb bnx2_poll_work bnx2_poll net_rx_action __do_softirq call_softirq do_softirq irq_exit do_IRQ ret_from_intr truncate_inode_pages proc_evict_inode evict iput dentry_iput d_kill __shrink_dcache_sb shrink_dcache_memory shrink_slab kswapd Anyway, maybe you'll see some interesting traces. If kswapd isn't waking very often, you can also trace "kmem:mm_page_alloc" or similar (see "perf list"), or try a smaller order or a longer sleep. Cheers, Simon- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>