Hi Tejun, On Fri, Apr 20, 2012 at 12:08:44PM -0700, Tejun Heo wrote: > Hello, Fengguang. > > On Fri, Apr 20, 2012 at 09:34:41PM +0800, Fengguang Wu wrote: > > > Yup. This just shows that you have to have per-cgroup dirty limits. Once > > > you have those, things start working again. > > > > Right. I think Tejun was more of less aware of this. > > I'm fairly sure I'm on the "less" side of it. OK. Sorry I should have explained why memcg dirty limit is not the right tool for back pressure based throttling. To limit memcg dirty pages, two thresholds will be introduced: 0 call for flush dirty limit ------------------------*--------------------------------*-----------------------> memcg dirty pages 1) when dirty pages increase to "call for flush" point, the memcg will explicitly ask the flusher thread to focus more on this memcg's inodes 2) when "dirty limit" is reached, the dirtier tasks will be throttled the hard way When there are few memcgs, or when the safety margin between the two thresholds are large enough, the dirty limit won't be hit and all goes virtually as smooth as when there are only global dirty limits. Otherwise the memcg dirty limit will be occasionally hit, but still should drop soon when the flusher thread round-robin to this memcg. Basically the more memcgs with dirty limits, the more hard time for the flusher to serve them fairly and knock down their dirty pages in time. Because the flusher works inode by inode, each one may take up to 0.5 second, and there may be many memcgs asking for the flusher's attention. Also the more memcgs, the global dirty pages pool are partitioned into smaller pieces, which means smaller safety margin for each memcg. Adding these two effects up, there may be constantly some memcgs hitting their dirty limits when there are dozens of memcgs. Hitting the dirty limits means all dirtiers tasks, including the light dirtiers who do occasional writes, become painfully slow. It's a bad state that should be avoided by any means. Now consider the back pressure case. When the user configured two blkcgs with 10:1 weights, the flusher will have great difficulties writeout pages for the latter blkcg. The corresponding memcg's dirty pages rush straightly to its dirty limit, _stay_ there and can never drop to normal. This means the latter blkcg's tasks will constantly see second-long time stalls. The solution would be to create an adaptive threshold blkcg.bdi.dirty_setpoint that's proportional to its buffered writeout bandwidth and teach balance_dirty_pages() to balance dirty pages around that target. It avoids the worst case of hitting dirty_limit. However it may still present big challenges to balance_dirty_pages(). For example, when there are 10 blkcgs and 12 JBOD disks, it may create up to 10*12=120 dirty balance targets. Wow I cannot imagine how it's going to fulfill so many different targets. > > I was rather upset by this per-memcg dirty_limit idea indeed. I never > > expect it to work well when used extensively. My plan was to set the > > default memcg dirty_limit high enough, so that it's not hit in normal. > > Then Tejun came and proposed to (mis-)use dirty_limit as the way to > > convert the dirty pages' backpressure into real dirty throttling rate. > > No, that's just crazy idea! > > I'll tell you what's crazy. > > We're not gonna cut three more kernel releases and then change jobs. > Some of the stuff we put in the kernel ends up staying there for over > a decade. While ignoring fundamental designs and violating layers may > look like rendering a quick solution. They tend to come back and bite > our collective asses. Ask Vivek. The iosched / blkcg API was messed > up to the extent that bugs were so difficult to track down and it was > nearly impossible to add new features, let alone new blkcg policy or > elevator and people did suffer for that for long time. I ended up > cleaning up the mess. It took me longer than three months and even > then we have to carry on with a lot of ugly stuff for compatibility. "block/cfq-iosched.c" 3930L Yeah it's a big pile of tricky code. In despite of that, the code structure still looks pretty neat, kudos to all of you! > Unfortunately, your proposed solution is far worse than blkcg was or > ever could be. It's not even contained in a single subsystem and it's > not even clear what it achieves. Yeah it's cross subsystems, mainly due to there are two natural throttling points: balance_dirty_pages() and cfq. It requires both sides to work properly. In my proposal, balance_dirty_pages() takes care to update the weights for async/direct IO on every 200ms and store it in blkcg. cfq then grabs the weights to update the cfq group's vdisktime. Such cross subsystem coordinations still look natural to me because "weight" is a fundamental and general parameter. It's really a blkcg thing (determined by the blkio.weight user interface) rather than specifically tied to cfq. When another kernel entity (eg. NFS or noop) decides to add support for proportional weight IO control in future, it can make use of the weights calculated by balance_dirty_pages(), too. That scheme does involve non-trivial complexities in the calculations, however IMHO sucks much less than let cfq take control and convey the information all the way up to balance_dirty_pages() via "backpressure". When balance_dirty_pages() takes part in the job, it merely costs some per-cpu accounting and calculations on every 200ms -- both scales pretty well. Virtually nothing changed (how buffered IO is performed) before/after applying IO controllers. From the users' perspective: - No more latency - No performance drop - No bumpy progress and stalls - No need to attach memcg to blkcg - Feel free to create 1000+ IO controllers, to heart's content w/o worrying about costs (if any, it would be some existing scalability issues) On the other hand, the back pressure scheme makes Linux more clumsy by vectorizing everything from bottom to up, giving rise to a number of problems: - in cfq, by splitting up the global async queue, cfq suddenly sees a number of cfq groups full of async requests lining up competing for the disk time. This could obscure things up and add difficulties to maintain low latency for sync requests. - in cfq, it will now be switching inodes based on the 40ms async slice time, which defeats the flusher thread's 500ms inode slice time. The below numbers show the performance cost of lowering the flusher's slices to ~40ms: 3.4.0-rc2 3.4.0-rc2-4M+ ----------- ------------------------ 114.02 -4.2% 109.23 snb/thresh=8G/xfs-1dd-1-3.4.0-rc2 102.25 -11.7% 90.24 snb/thresh=8G/xfs-10dd-1-3.4.0-rc2 104.17 -17.5% 85.91 snb/thresh=8G/xfs-20dd-1-3.4.0-rc2 104.94 -18.7% 85.28 snb/thresh=8G/xfs-30dd-1-3.4.0-rc2 104.76 -21.9% 81.82 snb/thresh=8G/xfs-100dd-1-3.4.0-rc2 We can do the optimization of increasing cfq async time slice when there are no sync IO. However in general cases it could still hurt. - in cfq, the lots more async queues will be holding much more async requests in order to prevent queue underrun. This proportionally scales up the number of writeback pages, which in turn exponentially scales up the difficulty to reclaim high order pages: P(reclaimable for THP) = P(non-PG_writeback)^512 That means we cannot comfortably use THP in a system with more than 0.1% writeback pages. Perhaps we need to work out some general optimizations to make writeback pages more concentrated in the physical memory space. Besides, when there are N seconds worth of writeback pages, it may take N/2 seconds on average for wait_on_page_writeback() to finish. So the total time cost of running into a random writeback page and waiting on it is also O(n^2): E(PG_writeback waits) = P(hit PG_writeback) * E(wait on it) That means we can hardly keep more than 1-second worth of writeback pages w/o worrying about long waits on PG_writeback in various parts of the kernel. - in the flusher, we'll need to vectorize the dirty inode lists, that's fine. However we either need to create one flusher per blkcg, which has the problem of intensify various fs lock contentions, or let one single flusher to walk through the blkcgs, which risks more cfq queue underruns. We may decrease the flusher's time slice or increase the queue size to mitigate this, however neither looks the exciting way. - balance_dirty_pages() will need to keep each blkcg's dirty pages at reasonable level, otherwise there may be starvations to defeat the low level IO controllers and to hurt IO size. Thus comes the very undesirable need to attach memcg to blkcg to track dirty pages. It's also not fun to work with dozens of dirty pages targets because dirty pages tend to fluctuate a lot. In comparison, it's far more easier for balance_dirty_pages() to dirty ratelimit 1000+ dd tasks in the global context. In summary, the back pressure scheme looks obvious at first sight, however there are some fundamental problems in the way. Cgroups are expected to be *light weight* facilities. Unfortunately this scheme will likely present too much burden and side effects to the system. It might become uncomfortable for the user to run 10+ blkcgs... > Neither weight or hard limit can be > properly enforced without another layer of controlling at the block > layer (some use cases do expect strict enforcement) and we're baking > assumptions about use cases, interfaces and underlying hardware across > multiple subsystems (some ssds work fine with per-iops switching). cfq still has the freedom to do per-iops switching, based on the same weight values computed by balance_dirty_pages(). cfq will need to feed back some "IO cost" stats based on either disk time or iops, upon which balance_dirty_pages() scales the throttling bandwidth for the dirtier tasks by the "IO cost". balance_dirty_pages() can also do IOPS hard limits based on the scaled throttling bandwidth. > For your suggested solution, the moment it's best fit is now and it'll > be a long painful way down until someone snaps and reimplements the > whole thing. > > The kernel is larger than balance_dirty_pages() or writeback. Each > subsystem should do what it's supposed to do. Let's solve problems > where they belong and pay overheads where they're due. Let's not > contort the whole stack for the short term goal of shoving writeback > support into the existing, still-developing, blkcg cfq proportional IO > implementation. Because that's pure insanity. To be frank I would be very pleased to avoid going into the pains of doing all the hairy computations to graft balance_dirty_pages() onto cfq, if ever the back pressure idea is not so upsetting. And if there are proper ways to address its problems, it would be a great relief for me to stop pondering on the details of disk time/IOPS feedback and the hierarchical support (yeah I think it's somehow possible now), and the foreseeable _numerous_ experiments to get the ideas into shape... Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html