On Fri 20-04-12 21:34:41, Wu Fengguang wrote: > On Thu, Apr 19, 2012 at 10:26:35PM +0200, Jan Kara wrote: > > > It's not uncommon for me to see filesystems sleep on PG_writeback > > > pages during heavy writeback, within some lock or transaction, which in > > > turn stall many tasks that try to do IO or merely dirty some page in > > > memory. Random writes are especially susceptible to such stalls. The > > > stable page feature also vastly increase the chances of stalls by > > > locking the writeback pages. > > > > > > Page reclaim may also block on PG_writeback and/or PG_dirty pages. In > > > the case of direct reclaim, it means blocking random tasks that are > > > allocating memory in the system. > > > > > > PG_writeback pages are much worse than PG_dirty pages in that they are > > > not movable. This makes a big difference for high-order page allocations. > > > To make room for a 2MB huge page, vmscan has the option to migrate > > > PG_dirty pages, but for PG_writeback it has no better choices than to > > > wait for IO completion. > > > > > > The difficulty of THP allocation goes up *exponentially* with the > > > number of PG_writeback pages. Assume PG_writeback pages are randomly > > > distributed in the physical memory space. Then we have formula > > > > > > P(reclaimable for THP) = 1 - P(hit PG_writeback)^256 > > Well, this implicitely assumes that PG_Writeback pages are scattered > > across memory uniformly at random. I'm not sure to which extent this is > > true... > > Yeah, when describing the problem I was also thinking about the > possibilities of optimization (it would be a very good general > improvements). Or maybe Mel already has some solutions :) > > > Also as a nitpick, this isn't really an exponential growth since > > the exponent is fixed (256 - actually it should be 512, right?). It's just > > Right, 512 4k pages to form one x86_64 2MB huge pages. > > > a polynomial with a big exponent. But sure, growth in number of PG_Writeback > > pages will cause relatively steep drop in the number of available huge > > pages. > > It's exponential indeed, because "1 - p(x)" here means "p(!x)". > It's exponential for a 10x increase in x resulting in 100x drop of y. If 'x' is the probability page has PG_Writeback set, then the probability a huge page has a single PG_Writeback page is (as you almost correctly wrote): (1-x)^512. This is a polynominal by the definition: It can be expressed as $\sum_{i=0}^n a_i*x^i$ for $a_i\in R$ and $n$ finite. The expression decreases fast as x approaches to 1, that's for sure, but that does not make it exponential. Sorry, my mathematical part could not resist this terminology correction. > > ... > > > > > To me, balance_dirty_pages() is *the* proper layer for buffered writes. > > > > > It's always there doing 1:1 proportional throttling. Then you try to > > > > > kick in to add *double* throttling in block/cfq layer. Now the low > > > > > layer may enforce 10:1 throttling and push balance_dirty_pages() away > > > > > from its balanced state, leading to large fluctuations and program > > > > > stalls. > > > > > > > > Just do the same 1:1 inside each cgroup. > > > > > > Sure. But the ratio mismatch I'm talking about is inter-cgroup. > > > For example there are only 2 dd tasks doing buffered writes in the > > > system. Now consider the mismatch that cfq is dispatching their IO > > > requests at 10:1 weights, while balance_dirty_pages() is throttling > > > the dd tasks at 1:1 equal split because it's not aware of the cgroup > > > weights. > > > > > > What will happen in the end? The 1:1 ratio imposed by > > > balance_dirty_pages() will take effect and the dd tasks will progress > > > at the same pace. The cfq weights will be defeated because the async > > > queue for the second dd (and cgroup) constantly runs empty. > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > you have those, things start working again. > > Right. I think Tejun was more of less aware of this. > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > expect it to work well when used extensively. My plan was to set the > default memcg dirty_limit high enough, so that it's not hit in normal. > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > convert the dirty pages' backpressure into real dirty throttling rate. > No, that's just crazy idea! > > Come on, let's not over-use memcg's dirty_limit. It's there as the > *last resort* to keep dirty pages under control so as to maintain > interactive performance inside the cgroup. However if used extensively > in the system (like dozens of memcgs all hit their dirty limits), the > limit itself may stall random dirtiers and create interactive > performance issues! > > In the recent days I've come up with the idea of memcg.dirty_setpoint > for the blkcg backpressure stuff. We can use that instead. > > memcg.dirty_setpoint will scale proportionally with blkcg.writeout_rate. > Imagine bdi_setpoint. It's all the same concepts. Why we need this? > Because if blkcg A and B does 10:1 weights and are both doing buffered > writes, their dirty pages should better be maintained around 10:1 > ratio to avoid underrun and hopefully achieve better IO size. > memcg.dirty_limit cannot guarantee that goal. I agree that to avoid stalls of throttled processes we shouldn't be hitting memcg.dirty_limit on a regular basis. When I wrote we need "per cgroup dirty limits" I actually imagined something like you write above - do complete throttling computations within each memcg - estimate throughput available for it, compute appropriate dirty rates for it's processes and from its dirty limit estimate appropriate setpoint to balance around. > But be warned! Partitioning the dirty pages always means more > fluctuations of dirty rates (and even stalls) that's perceivable by > the user. Which means another limiting factor for the backpressure > based IO controller to scale well. Sure, the smaller the memcg gets, the more noticeable these fluctuations would be. I would not expect memcg with 200 MB of memory to behave better (and also not much worse) than if I have a machine with that much memory... Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html