On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote: > On Fri, Aug 27, 2010 at 10:24:16AM +0100, Mel Gorman wrote: > > On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote: > > > On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote: > > > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote: > > > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote: > > > > > > If congestion_wait() is called when there is no congestion, the caller > > > > > > will wait for the full timeout. This can cause unreasonable and > > > > > > unnecessary stalls. There are a number of potential modifications that > > > > > > could be made to wake sleepers but this patch measures how serious the > > > > > > problem is. It keeps count of how many congested BDIs there are. If > > > > > > congestion_wait() is called with no BDIs congested, the tracepoint will > > > > > > record that the wait was unnecessary. > > > > > > > > > > I am not convinced that unnecessary is the right word. On a workload > > > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed > > > > > the VM regressing both in time and in reclaiming the right pages when > > > > > simply removing congestion_wait() from the direct reclaim paths (the > > > > > one in __alloc_pages_slowpath and the other one in > > > > > do_try_to_free_pages). > > > > > > > > > > So just being stupid and waiting for the timeout in direct reclaim > > > > > while kswapd can make progress seemed to do a better job for that > > > > > load. > > > > > > > > > > I can not exactly pinpoint the reason for that behaviour, it would be > > > > > nice if somebody had an idea. > > > > > > > > > > > > > There is a possibility that the behaviour in that case was due to flusher > > > > threads doing the writes rather than direct reclaim queueing pages for IO > > > > in an inefficient manner. So the stall is stupid but happens to work out > > > > well because flusher threads get the chance to do work. > > > > > > The workload was accessing a large sparse-file through mmap, so there > > > wasn't much IO in the first place. > > > > > > > Then waiting on congestion was the totally wrong thing to do. We were > > effectively calling sleep(HZ/10) and magically this was helping in some > > undefined manner. Do you know *which* called of congestion_wait() was > > the most important to you? > > Removing congestion_wait() in do_try_to_free_pages() definitely > worsens reclaim behaviour for this workload: > > 1. wallclock time of the testrun increases by 11% > > 2. the scanners do a worse job and go for the wrong zone: > > -pgalloc_dma 79597 > -pgalloc_dma32 134465902 > +pgalloc_dma 297089 > +pgalloc_dma32 134247237 > > -pgsteal_dma 77501 > -pgsteal_dma32 133939446 > +pgsteal_dma 294998 > +pgsteal_dma32 133722312 > > -pgscan_kswapd_dma 145897 > -pgscan_kswapd_dma32 266141381 > +pgscan_kswapd_dma 287981 > +pgscan_kswapd_dma32 186647637 > > -pgscan_direct_dma 9666 > -pgscan_direct_dma32 1758655 > +pgscan_direct_dma 302495 > +pgscan_direct_dma32 80947179 > > -pageoutrun 1768531 > -allocstall 614 > +pageoutrun 1927451 > +allocstall 8566 > > I attached the full vmstat contents below. Also the test program, > which I ran in this case as: ./mapped-file-stream 1 $((512 << 30)) > Excellent stuff. I didn't look at your vmstat output because it was for an old patch and you have already highlighted the problems related to the workload. Chances are, I'd just reach the same conclusions. What is interesting is your workload. > > > > > So personally I think it's a good idea to get an insight on the use of > > > > > congestion_wait() [patch 1] but I don't agree with changing its > > > > > behaviour just yet, or judging its usefulness solely on whether it > > > > > correctly waits for bdi congestion. > > > > > > > > > > > > > Unfortunately, I strongly suspect that some of the desktop stalls seen during > > > > IO (one of which involved no writes) were due to calling congestion_wait > > > > and waiting the full timeout where no writes are going on. > > > > > > Oh, I am in full agreement here! Removing those congestion_wait() as > > > described above showed a reduction in peak latency. The dilemma is > > > only that it increased the overall walltime of the load. > > > > > > > Do you know why because leaving in random sleeps() hardly seems to be > > the right approach? > > I am still trying to find out what's going wrong. > > > > And the scanning behaviour deteriorated, as in having increased > > > scanning pressure on other zones than the unpatched kernel did. > > > > > > > Probably because it was scanning more but not finding what it needed. > > There is a condition other than congestion it is having trouble with. In > > some respects, I think if we change congestion_wait() as I propose, > > we may see a case where CPU usage is higher because it's now > > encountering the unspecified reclaim problem we have. > > Exactly. > > > > So I think very much that we need a fix. congestion_wait() causes > > > stalls and relying on random sleeps for the current reclaim behaviour > > > can not be the solution, at all. > > > > > > I just don't think we can remove it based on the argument that it > > > doesn't do what it is supposed to do, when it does other things right > > > that it is not supposed to do ;-) > > > > > > > We are not removing it, we are just stopping it going to sleep for > > stupid reasons. If we find that wall time is increasing as a result, we > > have a path to figuring out what the real underlying problem is instead > > of sweeping it under the rug. > > Well, for that testcase it is in effect the same as a removal as > there's never congestion. > > But again: I agree with your changes per-se, I just don't think they > should get merged as long as they knowingly catalyze a problem that > has yet to be identified. Ok, well there was some significant feedback on why wholesale changing of congestion_wait() reached too far and I've incorporated that feedback. I also integrated your workload into my testsuite (btw, because there is no license the script has to download it from a google archive. I might get back to you on licensing this so it can be made a permanent part of the suite). These are the results just for your workload on the only machine I had available with a lot of disk. There are a bunch of kernels because I'm testing a superset of different series posted recently. The nocongest column is an unreleased patch that has congestion_wait() and wait_iff_congested() that only goes to sleep if there is real congestion or a lot of writeback going on. Rather than worrying about the patch contents for now, lets consider the results for just your workload. The report is in 4 parts. The first is the vmstat counter differences as a result of running your test. The exact interpretation of good and bad here is open to interpretation. The second part is based on the vmscan tracepoints. The third part is based on the congestion tracepoints and the final part reports CPU usage and elapsed time. MICRO traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4 pgalloc_dma 89409.00 ( 0.00%) 47750.00 ( -87.24%) 47430.00 ( -88.51%) 47246.00 ( -89.24%) pgalloc_dma32 101407571.00 ( 0.00%) 101518722.00 ( 0.11%) 101502059.00 ( 0.09%) 101511868.00 ( 0.10%) pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgsteal_dma 74529.00 ( 0.00%) 43386.00 ( -71.78%) 43213.00 ( -72.47%) 42691.00 ( -74.58%) pgsteal_dma32 100666955.00 ( 0.00%) 100712596.00 ( 0.05%) 100712537.00 ( 0.05%) 100713305.00 ( 0.05%) pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgscan_kswapd_dma 118198.00 ( 0.00%) 47370.00 (-149.52%) 49515.00 (-138.71%) 46134.00 (-156.21%) pgscan_kswapd_dma32 177619794.00 ( 0.00%) 161549938.00 ( -9.95%) 161679701.00 ( -9.86%) 156657926.00 ( -13.38%) pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgscan_direct_dma 27128.00 ( 0.00%) 39215.00 ( 30.82%) 36561.00 ( 25.80%) 38806.00 ( 30.09%) pgscan_direct_dma32 23927492.00 ( 0.00%) 40122173.00 ( 40.36%) 39997463.00 ( 40.18%) 45041626.00 ( 46.88%) pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pageoutrun 756020.00 ( 0.00%) 903192.00 ( 16.29%) 899965.00 ( 15.99%) 868055.00 ( 12.91%) allocstall 2722.00 ( 0.00%) 70156.00 ( 96.12%) 67554.00 ( 95.97%) 87691.00 ( 96.90%) So, the allocstall counts go up of course because it is incremented every time direct reclaim is entered and nocongest is only going to sleep when there is congestion or significant writeback. I don't see this as being nceessarily bad. Direct scanning rates go up a bit as you'd expect - again because we are sleeping less. It's interesting that the pages reclaimed is reduced implying that despite higher scanning rates, there is less reclaim activity. It's debatable if this is good or not because higher scanning rates in themselves are not bad but fewer pages reclaimed seems positive so lets see what the rest of the reports look like. FTrace Reclaim Statistics: vmscan micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4 Direct reclaims 2722 70156 67554 87691 Direct reclaim pages scanned 23955333 40161426 40034132 45080524 Direct reclaim write file async I/O 0 0 0 0 Direct reclaim write anon async I/O 0 0 0 0 Direct reclaim write file sync I/O 0 0 0 0 Direct reclaim write anon sync I/O 0 0 0 0 Wake kswapd requests 2718040 17801688 17622777 18379572 Kswapd wakeups 24 1 1 1 Kswapd pages scanned 177738381 161597313 161729224 156704078 Kswapd reclaim write file async I/O 0 0 0 0 Kswapd reclaim write anon async I/O 0 0 0 0 Kswapd reclaim write file sync I/O 0 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 0 Time stalled direct reclaim (seconds) 247.97 76.97 77.15 76.63 Time kswapd awake (seconds) 489.17 400.20 403.19 390.08 Total pages scanned 201693714 201758739 201763356 201784602 %age total pages scanned/written 0.00% 0.00% 0.00% 0.00% %age file pages scanned/written 0.00% 0.00% 0.00% 0.00% Percentage Time Spent Direct Reclaim 41.41% 16.03% 15.96% 16.32% Percentage Time kswapd Awake 98.76% 98.94% 98.96% 98.87% Interesting, kswapd is now staying awake (woke up only once) even though the total time awake was reduced and it looks like because it was requested to wake up a lot more that was keeping it awake. Despite the higher scan rates from direct reclaim, the time actually spent direct reclaiming is significantly reduced. Scanning rates and times we direct reclaim go up but as we finish work a lot faster, it would seem that we are doing less work overall. FTrace Reclaim Statistics: congestion_wait Direct number congest waited 3664 0 0 0 Direct time congest waited 247636ms 0ms 0ms 0ms Direct full congest waited 3081 0 0 0 Direct number conditional waited 0 47587 45659 58779 Direct time conditional waited 0ms 0ms 0ms 0ms Direct full conditional waited 3081 0 0 0 KSwapd number congest waited 1448 949 909 981 KSwapd time congest waited 118552ms 31652ms 32780ms 38732ms KSwapd full congest waited 1056 90 115 147 KSwapd number conditional waited 0 0 0 0 KSwapd time conditional waited 0ms 0ms 0ms 0ms KSwapd full conditional waited 1056 90 115 147 congest waited is congestion_wait() and conditional waited is wait_iff_congested(). Look at what happens to the congest waited times for direct reclaim - it disappeared and despite the number of times wait_iff_congested() was called, it never actually decided it needed to sleep. kswapd is still congestion waiting but the time it spent is reduced. MMTests Statistics: duration User/Sys Time Running Test (seconds) 350.9 403.27 406.12 393.02 Total Elapsed Time (seconds) 495.29 404.47 407.44 394.53 This is plain old time. The same test completes 91 seconds faster. Ordinarily at this point I would be preparing to do a full series report including the other benchmarks but I'm interested in seeing if there is a significantly different reading of the above report as to whether it is a "good" or "bad" result? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html