On Mon, Oct 11, 2010 at 04:56:48PM +0800, Mel Gorman wrote: > On Sat, Oct 09, 2010 at 08:58:07AM +0800, Shaohua Li wrote: > > On Fri, Oct 08, 2010 at 11:29:53PM +0800, Mel Gorman wrote: > > > On Tue, Sep 28, 2010 at 01:08:01PM +0800, Shaohua Li wrote: > > > > In a 4 socket 64 CPU system, zone_nr_free_pages() takes about 5% ~ 10% cpu time > > > > according to perf when memory pressure is high. The workload does something > > > > like: > > > > for i in `seq 1 $nr_cpu` > > > > do > > > > create_sparse_file $SPARSE_FILE-$i $((10 * mem / nr_cpu)) > > > > $USEMEM -f $SPARSE_FILE-$i -j 4096 --readonly $((10 * mem / nr_cpu)) & > > > > done > > > > this simply reads a sparse file for each CPU. Apparently the > > > > zone->percpu_drift_mark is too big, and guess zone_page_state_snapshot() makes > > > > a lot of cache bounce for ->vm_stat_diff[]. below is the zoneinfo for reference. > > > > > > Would it be possible for you to post the oprofile report? I'm in the > > > early stages of trying to reproduce this locally based on your test > > > description. The first machine I tried showed that zone_nr_page_state > > > was consuming 0.26% of profile time with the vast bulk occupied by > > > do_mpage_readahead. See as follows > > > > > > 1599339 53.3463 vmlinux-2.6.36-rc7-pcpudrift do_mpage_readpage > > > 131713 4.3933 vmlinux-2.6.36-rc7-pcpudrift __isolate_lru_page > > > 103958 3.4675 vmlinux-2.6.36-rc7-pcpudrift free_pcppages_bulk > > > 85024 2.8360 vmlinux-2.6.36-rc7-pcpudrift __rmqueue > > > 78697 2.6250 vmlinux-2.6.36-rc7-pcpudrift native_flush_tlb_others > > > 75678 2.5243 vmlinux-2.6.36-rc7-pcpudrift unlock_page > > > 68741 2.2929 vmlinux-2.6.36-rc7-pcpudrift get_page_from_freelist > > > 56043 1.8693 vmlinux-2.6.36-rc7-pcpudrift __alloc_pages_nodemask > > > 55863 1.8633 vmlinux-2.6.36-rc7-pcpudrift ____pagevec_lru_add > > > 46044 1.5358 vmlinux-2.6.36-rc7-pcpudrift radix_tree_delete > > > 44543 1.4857 vmlinux-2.6.36-rc7-pcpudrift shrink_page_list > > > 33636 1.1219 vmlinux-2.6.36-rc7-pcpudrift zone_watermark_ok > > > ..... > > > 7855 0.2620 vmlinux-2.6.36-rc7-pcpudrift zone_nr_free_pages > > > > > > The machine I am testing on is non-NUMA 4-core single socket and totally > > > different characteristics but I want to be sure I'm going more or less the > > > right direction with the reproduction case before trying to find a larger > > > machine. > > Here it is. this is a 4 socket nahalem machine. > > 268160.00 57.2% _raw_spin_lock /lib/modules/2.6.36-rc5-shli+/build/vmlinux > > 40302.00 8.6% zone_nr_free_pages /lib/modules/2.6.36-rc5-shli+/build/vmlinux > > 36827.00 7.9% do_mpage_readpage /lib/modules/2.6.36-rc5-shli+/build/vmlinux > > 28011.00 6.0% _raw_spin_lock_irq /lib/modules/2.6.36-rc5-shli+/build/vmlinux > > 22973.00 4.9% flush_tlb_others_ipi /lib/modules/2.6.36-rc5-shli+/build/vmlinux > > 10713.00 2.3% smp_invalidate_interrupt /lib/modules/2.6.36-rc5-shli+/build/vmlinux > > Ok, we are seeing *very* different things. Can you tell me more about > what usemem actually does? I thought it might be doing something like > mapping the file and just reading it but that doesn't appear to be the > case. I also tried using madvise dropping pages to strictly limit how > much memory was used but the profiles are still different. > > I've posted the very basic test script I was using based on your > description. Can you tell me what usemem does differently or better again, > post the source of usemem? Can you also post your .config please. I'm curious > to see why you are seeing so much more locking overhead. If you have lock > debugging and lock stat enabled, would it be possible to test without them > enabled to see what the profile looks like? Basically the similar test. I'm using Fengguang's test, please check attached file. I didn't enable lock stat or debug. The difference is my test is under a 4 socket system. In a 1 socket system, I don't see the issue too. Thanks, Shaohua
Attachment:
test.tgz
Description: GNU Unix tar archive