On Tue 29-05-12 15:51:01, Michal Hocko wrote: [...] > OK, I have tried it with a simpler approach: > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c978ce4..e45cf2a 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1294,8 +1294,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > * isolated page is PageWriteback > */ > if (nr_writeback && nr_writeback >= > - (nr_taken >> (DEF_PRIORITY - sc->priority))) > - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); > + (nr_taken >> (DEF_PRIORITY - sc->priority))) { > + if (global_reclaim(sc)) > + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); > + else > + congestion_wait(BLK_RW_ASYNC, HZ/10); > + } > > trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id, > zone_idx(zone), > [...] > As a conclusion congestion wait performs better (even though I haven't > done repeated testing to see what is the deviation) when the > reader/writer size doesn't fit into the memcg, while it performs much > worse (at least for writer) if it does fit. > > I will play with that some more I have, yet again, updated the test. I am writing data to an USB stick (with ext3, mounted in sync mode) and which writes 1G in 274.518s, 3.8MB/s so the storage is really slow. The parallel read is performed from tmpfs and from a local ext3 partition (testing script is attached). We start with writing so the LRUs will have some dirty pages when the read starts and fill up the LRU with clean page cache. congestion wait: ================ * ext3 (reader) avg std/avg ** Write 5M 412.128 334.944 337.708 339.457 356.0593 [10.51%] 60M 566.652 321.607 492.025 317.942 424.5565 [29.39%] 300M 318.437 315.321 319.515 314.981 317.0635 [0.71%] 2G 317.777 314.8 318.657 319.409 317.6608 [0.64%] ** Read 5M 40.1829 40.8907 48.8362 40.0535 42.4908 [9.99%] 60M 15.4104 16.1693 18.9162 16.0049 16.6252 [9.39%] 300M 17.0376 15.6721 15.6137 15.756 16.0199 [4.25%] 2G 15.3718 17.3714 15.3873 15.4554 15.8965 [6.19%] * Tmpfs (reader) ** Write 5M 324.425 327.395 573.688 314.884 385.0980 [32.68%] 60M 464.578 317.084 375.191 318.947 368.9500 [18.76%] 300M 316.885 323.759 317.212 318.149 319.0013 [1.01%] 2G 317.276 318.148 318.97 316.897 317.8228 [0.29%] ** Read 5M 0.9241 0.8620 0.9391 1.2922 1.0044 [19.39%] 60M 0.8753 0.8359 1.0072 1.3317 1.0125 [22.23%] 300M 0.9825 0.8143 0.9864 0.8692 0.9131 [9.35%] 2G 0.9990 0.8281 1.0312 0.9034 0.9404 [9.83%] PageReclaim: ============= * ext3 (reader) ** Write avg std/avg comparision (cong is 100%) 5M 313.08 319.924 325.206 325.149 320.8398 [1.79%] 90.11% 60M 314.135 415.245 502.157 313.776 386.3283 [23.50%] 91.00% 300M 313.718 320.448 315.663 316.714 316.6358 [0.89%] 99.87% 2G 317.591 316.67 316.285 316.624 316.7925 [0.18%] 99.73% ** Read 5M 19.0228 20.6743 17.2508 17.5946 18.6356 [8.37%] 43.86% 60M 17.3657 15.6402 16.5168 15.5601 16.2707 [5.22%] 97.87% 300M 17.1986 15.7616 19.5163 16.9544 17.3577 [9.05%] 108.35% 2G 15.6696 15.5384 15.4381 15.2454 15.4729 [1.16%] 97.34% * Tmpfs (reader) ** Write 5M 317.303 314.366 316.508 318.883 316.7650 [0.59%] 82.26% 60M 579.952 666.606 660.021 655.346 640.4813 [6.34%] 173.60% 300M 318.494 318.64 319.516 316.79 318.3600 [0.36%] 99.80% 2G 315.935 318.069 321.097 320.329 318.8575 [0.73%] 100.33% ** Read 5M 0.8415 0.8550 0.7892 0.8515 0.8343 [3.67%] 83.07% 60M 0.8536 0.8685 0.8237 0.8805 0.8565 [2.86%] 84.60% 300M 0.8309 0.8724 0.8553 0.8577 0.8541 [2.01%] 93.53% 2G 0.8427 0.8468 0.8325 1.4658 0.9970 [31.36%] 106.01% Variance (std/avg) seems to be lower for both reads and writes with PageReclaim approach and also if we compare the average numbers it seems to be mostly better (especially for reads) or within the noise. There are two "peaks" in numbers, though. * 60M cgroup write performance when reading from tmpfs. While read behaved well with PageReclaim patch (actually much better than congwait), the writer stalled a lot. * 5M cgroup read performance when reading from ext3 when congestion_wait approach fall down flat while PageReclaim did better for both read and write. So I guess that the conclusion could be that the two approaches are comparable. Both of them could lead to stalling but they are doing mostly good which is much better than an OOM killer. We can do much better but that would require conditional sleeping. How do people feel about going with the simpler approach for now (even for stable kernels as the problem is real and long term) and work on the conditional part as a follow up? Which way would be preferable? I can post a full patch for the congestion wait approach if you are interested. I do not care much as both of them fix the problem. -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic
Attachment:
cgroup_cache_oom_test.sh
Description: Bourne shell script