On Sat, Sep 04, 2010 at 10:25:45AM +0800, Dave Chinner wrote: > On Fri, Sep 03, 2010 at 04:00:26PM -0700, Andrew Morton wrote: > > On Fri, 3 Sep 2010 10:08:46 +0100 > > Mel Gorman <mel@xxxxxxxxx> wrote: > > > > > When under significant memory pressure, a process enters direct reclaim > > > and immediately afterwards tries to allocate a page. If it fails and no > > > further progress is made, it's possible the system will go OOM. However, > > > on systems with large amounts of memory, it's possible that a significant > > > number of pages are on per-cpu lists and inaccessible to the calling > > > process. This leads to a process entering direct reclaim more often than > > > it should increasing the pressure on the system and compounding the problem. > > > > > > This patch notes that if direct reclaim is making progress but > > > allocations are still failing that the system is already under heavy > > > pressure. In this case, it drains the per-cpu lists and tries the > > > allocation a second time before continuing. > .... > > The patch looks reasonable. > > > > But please take a look at the recent thread "mm: minute-long livelocks > > in memory reclaim". There, people are pointing fingers at that > > drain_all_pages() call, suspecting that it's causing huge IPI storms. > > > > Dave was going to test this theory but afaik hasn't yet done so. It > > would be nice to tie these threads together if poss? > > It's been my "next-thing-to-do" since David suggested I try it - > tracking down other problems has got in the way, though. I > just ran my test a couple of times through: > > $ ./fs_mark -D 10000 -L 63 -S0 -n 100000 -s 0 \ > -d /mnt/scratch/0 -d /mnt/scratch/1 \ > -d /mnt/scratch/3 -d /mnt/scratch/2 \ > -d /mnt/scratch/4 -d /mnt/scratch/5 \ > -d /mnt/scratch/6 -d /mnt/scratch/7 > > To create millions of inodes in parallel on an 8p/4G RAM VM. > The filesystem is ~1.1TB XFS: > > # mkfs.xfs -f -d agcount=16 /dev/vdb > meta-data=/dev/vdb isize=256 agcount=16, agsize=16777216 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=268435456, imaxpct=5 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal log bsize=4096 blocks=131072, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > # mount -o inode64,delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch > > Performance prior to this patch was that each iteration resulted in > ~65k files/s, with occassionaly peaks to 90k files/s, but drops to > frequently 45k files/s when reclaim ran to reclaim the inode > caches. This load ran permanently at 800% CPU usage. > > Every so often (may once or twice a 50M inode create run) all 8 CPUs > would remain pegged but the create rate would drop to zero for a few > seconds to a couple of minutes. that was the livelock issues I > reported. > > With this patchset, I'm seeing a per-iteration average of ~77k > files/s, with only a couple of iterations dropping down to ~55k > file/s and a significantly number above 90k/s. The runtime to 50M > inodes is down by ~30% and the average CPU usage across the run is > around 700%. IOWs, there a significant gain in performance there is > a significant drop in CPU usage. I've done two runs to 50m inodes, > and not seen any sign of a livelock, even for short periods of time. > > Ah, spoke too soon - I let the second run keep going, and at ~68M > inodes it's just pegged all the CPUs and is pretty much completely > wedged. Serial console is not responding, I can't get a new login, > and the only thing responding that tells me the machine is alive is > the remote PCP monitoring. It's been stuck for 5 minutes .... and > now it is back. Here's what I saw: > > http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-wedge-1.png > > The livelock is at the right of the charts, where the top chart is > all red (system CPU time), and the other charts flat line to zero. > > And according to fsmark: > > 1 66400000 0 64554.2 7705926 > 1 67200000 0 64836.1 7573013 > <hang happened here> > 2 68000000 0 69472.8 7941399 > 2 68800000 0 85017.5 7585203 > > it didn't record any change in performance, which means the livelock > probably occurred between iterations. I couldn't get any info on > what caused the livelock this time so I can only assume it has the > same cause.... > > Still, given the improvements in performance from this patchset, > I'd say inclusion is a no-braniner.... In your case it's not really high memory pressure, but maybe too many concurrent direct reclaimers, so that when one reclaimed some free pages, others kick in and "steal" the free pages. So we need to kill the second cond_resched() call (which effectively gives other tasks a good chance to steal this task's vmscan fruits), and only do drain_all_pages() when nothing was reclaimed (instead of allocated). Dave, will you give a try of this patch? It's based on Mel's. Thanks, Fengguang --- --- linux-next.orig/mm/page_alloc.c 2010-09-04 11:08:03.000000000 +0800 +++ linux-next/mm/page_alloc.c 2010-09-04 11:16:33.000000000 +0800 @@ -1850,6 +1850,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m cond_resched(); +retry: /* We now go into synchronous reclaim */ cpuset_memory_pressure_bump(); p->flags |= PF_MEMALLOC; @@ -1863,26 +1864,23 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m lockdep_clear_current_reclaim_state(); p->flags &= ~PF_MEMALLOC; - cond_resched(); - - if (unlikely(!(*did_some_progress))) + if (unlikely(!(*did_some_progress))) { + if (!drained) { + drain_all_pages(); + drained = true; + goto retry; + } return NULL; + } -retry: page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, alloc_flags, preferred_zone, migratetype); - /* - * If an allocation failed after direct reclaim, it could be because - * pages are pinned on the per-cpu lists. Drain them and try again - */ - if (!page && !drained) { - drain_all_pages(); - drained = true; + /* someone steal our vmscan fruits? */ + if (!page && *did_some_progress) goto retry; - } return page; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>