Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails

Mel Gorman <mel@xxxxxxxxx> · Sun, 5 Sep 2010 19:22:56 +0100

On Sat, Sep 04, 2010 at 12:25:45PM +1000, Dave Chinner wrote:
> On Fri, Sep 03, 2010 at 04:00:26PM -0700, Andrew Morton wrote:
> > On Fri,  3 Sep 2010 10:08:46 +0100
> > Mel Gorman <mel@xxxxxxxxx> wrote:
> > 
> > > When under significant memory pressure, a process enters direct reclaim
> > > and immediately afterwards tries to allocate a page. If it fails and no
> > > further progress is made, it's possible the system will go OOM. However,
> > > on systems with large amounts of memory, it's possible that a significant
> > > number of pages are on per-cpu lists and inaccessible to the calling
> > > process. This leads to a process entering direct reclaim more often than
> > > it should increasing the pressure on the system and compounding the problem.
> > > 
> > > This patch notes that if direct reclaim is making progress but
> > > allocations are still failing that the system is already under heavy
> > > pressure. In this case, it drains the per-cpu lists and tries the
> > > allocation a second time before continuing.
> ....
> > The patch looks reasonable.
> > 
> > But please take a look at the recent thread "mm: minute-long livelocks
> > in memory reclaim".  There, people are pointing fingers at that
> > drain_all_pages() call, suspecting that it's causing huge IPI storms.
> > 
> > Dave was going to test this theory but afaik hasn't yet done so.  It
> > would be nice to tie these threads together if poss?
> 
> It's been my "next-thing-to-do" since David suggested I try it -
> tracking down other problems has got in the way, though. I
> just ran my test a couple of times through:
> 
> $ ./fs_mark -D 10000 -L 63 -S0 -n 100000 -s 0 \
> 	-d /mnt/scratch/0 -d /mnt/scratch/1 \
> 	-d /mnt/scratch/3 -d /mnt/scratch/2 \
> 	-d /mnt/scratch/4 -d /mnt/scratch/5 \
> 	-d /mnt/scratch/6 -d /mnt/scratch/7
> 
> To create millions of inodes in parallel on an 8p/4G RAM VM.
> The filesystem is ~1.1TB XFS:
> 
> # mkfs.xfs -f -d agcount=16 /dev/vdb
> meta-data=/dev/vdb               isize=256    agcount=16, agsize=16777216 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=268435456, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=131072, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> # mount -o inode64,delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch
> 

Unfortunately, I doubt I'll be able to reproduce this test. I don't have
access to a machine with enough processors or disk. I will try on 4p/4G
and 500M and see how that pans out.

> Performance prior to this patch was that each iteration resulted in
> ~65k files/s, with occassionaly peaks to 90k files/s, but drops to
> frequently 45k files/s when reclaim ran to reclaim the inode
> caches. This load ran permanently at 800% CPU usage.
> 
> Every so often (may once or twice a 50M inode create run) all 8 CPUs
> would remain pegged but the create rate would drop to zero for a few
> seconds to a couple of minutes. that was the livelock issues I
> reported.
> 

Should be easy to spot at least.

> With this patchset, I'm seeing a per-iteration average of ~77k
> files/s, with only a couple of iterations dropping down to ~55k
> file/s and a significantly number above 90k/s. The runtime to 50M
> inodes is down by ~30% and the average CPU usage across the run is
> around 700%. IOWs, there a significant gain in performance there is
> a significant drop in CPU usage. I've done two runs to 50m inodes,
> and not seen any sign of a livelock, even for short periods of time.
> 

Very cool.

> Ah, spoke too soon - I let the second run keep going, and at ~68M
> inodes it's just pegged all the CPUs and is pretty much completely
> wedged. Serial console is not responding, I can't get a new login,
> and the only thing responding that tells me the machine is alive is
> the remote PCP monitoring. It's been stuck for 5 minutes .... and
> now it is back. Here's what I saw:
> 
> http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-wedge-1.png
> 
> The livelock is at the right of the charts, where the top chart is
> all red (system CPU time), and the other charts flat line to zero.
> 
> And according to fsmark:
> 
>      1     66400000            0      64554.2          7705926
>      1     67200000            0      64836.1          7573013
> <hang happened here>
>      2     68000000            0      69472.8          7941399
>      2     68800000            0      85017.5          7585203
> 
> it didn't record any change in performance, which means the livelock
> probably occurred between iterations.  I couldn't get any info on
> what caused the livelock this time so I can only assume it has the
> same cause....
> 

Not sure where you could have gotten stuck. I thought it might have
locked up in congestion_wait() but it wouldn't have locked up this badly
if that was teh case. Sluggish sure but not that dead.

I'll see about reproducing with your test tomorrow and see what I find.
Thanks.

> Still, given the improvements in performance from this patchset,
> I'd say inclusion is a no-braniner....
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>