Hi Dave, On Wed, Mar 23, 2011 at 12:41:52PM +0800, Dave Chinner wrote: > On Tue, Mar 22, 2011 at 10:43:14PM +0100, Jan Kara wrote: > > Hello Fengguang, > > > > On Fri 18-03-11 22:30:01, Wu Fengguang wrote: > > > On Wed, Mar 09, 2011 at 06:31:10AM +0800, Jan Kara wrote: > > > > > > > > Hello, > > > > > > > > I'm posting second version of my IO-less balance_dirty_pages() patches. This > > > > is alternative approach to Fengguang's patches - much simpler I believe (only > > > > 300 lines added) - but obviously I does not provide so sophisticated control. > > > > > > Well, it may be too early to claim "simplicity" as an advantage, until > > > you achieve the following performance/feature comparability (most of > > > them are not optional ones). AFAICS this work is kind of heavy lifting > > > that will consume a lot of time and attention. You'd better find some > > > more fundamental needs before go on the reworking. To start with, let me explain the below items and you judge whether they are desirable ones now and future. > > > (1) latency The potential impact a write(2) syscall get blocked for over 200ms could be - user perceptible unresponsiveness - interrupted IO pipeline in such kind of applications loop { read(buf) write(buf) } In particular the read() IO stream will be interrupted. If it's a read stream from another disk, then the disk may go idle for the period. If it's a read stream from network, the client side uploading files via an ADSL link will find the upload bandwidth under-utilized. > > > (2) fairness This is raised by Jan Kara. It should be application dependent. Some workloads may not care at all. Some may be more sensitive. > > > (3) smoothness It reflects the variation of a task's throttled dirty bandwidth. The fluctuation of dirty rate can degrade the possible read-write IO pipeline in the similar way as (1). > > > (4) scalability Being able to scale well to 1000+ dirtiers, and to large number of CPUs and disks. > > > (5) per-task IO controller > > > (6) per-cgroup IO controller (TBD) > > > (7) free combinations of per-task/per-cgroup and bandwidth/priority controllers > > > (8) think time compensation The IO controller stuff. It seems there are only disparities on when and how to provide the features. (5) and priority based IO controller are the side product of the base bandwidth solution. > > > (9) backed by both theory and tests The prerequisites for quality code. > > > (10) adapt pause time up on 100+ dirtiers To reduce CPU overheads, requested by you :) > > > (11) adapt pause time down on low dirty pages This is necessary to avoid < 100% disk utilization on small memory systems. > > > (12) adapt to new dirty threshold/goal This is to support dynamically lowering the number of dirty pages in order to avoid excessive page_out() during page reclaim. > > > (13) safeguard against dirty exceeding This is a must have to prevent DoS situations. There have to be a hard dirty limit that force the dirtiers to loop inside balance_dirty_pages(). Because the regular throttling scheme in both Jan and my patches can at most throttle the dirtier at eg. 200ms per page. This is not enough to stop the dirty pages from growing high when there are many dirtiers writing to a slow USB stick. > > > (14) safeguard against device queue underflow (in JBOD case, maintain reasonable number of dirty pages in each disk queue) This is necessary to avoid < 100% disk utilization on small memory systems. Note that (14) and (11) share the same goal, however need independent safeguards at two different places. > > I think this is a misunderstanding of my goals ;). My main goal is to > > explore, how far we can get with a relatively simple approach to IO-less > > balance_dirty_pages(). I guess what I have is better than the current > > balance_dirty_pages() but it sure does not even try to provide all the > > features you try to provide. > > This is my major concern - maintainability of the code. It's all > well and good to evaluate the code based on it's current > performance, but what about 2 or 3 years down the track when for > some reason it's not working like it was intended - just like what > happened with slow degradation in writeback performance between > ~2.6.15 and ~2.6.30. > > Fundamentally, the _only_ thing I want balance_dirty_pages() to do > is _not issue IO_. Issuing IO in balance_dirty_pages() simply does > not scale, especially for devices that have no inherent concurrency. > I don't care if the solution is not perfectly fair or that there is > some latency jitter between threads, I just want to avoid having the > IO issue patterns change drastically when the system runs out of > clean pages. > > IMO, that's all we should be trying to acheive with IO-less write > throttling right now. Get that algorithm and infrastructure right > first, then we can work out how to build on that to do more fancy > stuff. You already got what you want in the base bandwidth patches v6 :) It has been tested extensively and is ready for more wide -mm exercises. Plus more for others that actually care. As long as the implemented features/performance are - desirable in long term - cannot be done in fundamentally more simple/robust way Then I see no point to drop the ready-to-work solution and take so much efforts to breed another one, perhaps taking 1-2 more release cycles only to reach the same level of performance _and_ complexity. > > I'm thinking about tweaking ratelimiting logic to reduce latencies in some > > tests, possibly add compensation when we waited for too long in > > balance_dirty_pages() (e.g. because of bumpy IO completion) but that's > > about it... > > > > Basically I do this so that we can compare and decide whether what my > > simple approach offers is OK or whether we want some more complex solution > > like your patches... > > I agree completely. > > FWIW (and that may not be much), the IO-less write throttling that I > wrote for Irix back in 2004 was very simple and very effective - > input and output bandwidth estimation updated once per second, with > a variable write syscall delay applied on each syscall also > calculated once per second. The change to the delay was based on the > difference between input and output rates and the number of write > syscalls per second. > > I tried all sorts of fancy stuff to improve it, but the corner cases > in anything fancy led to substantial complexity of algorithms and > code and workloads that just didn't work well. In the end, simple > worked better than fancy and complex and was easier to understand, > predict and tune.... I may be one of the biggest sufferers of the inherent writeback complexities, and would be more than pleased to have some simple scheme that deals with only simple requirements. However.. I feel obliged to treat smoothness as one requirement when everybody are complaining writeback responsiveness issues in bugzilla, mailing list as well as in LSF and kernel summit. To help debug and evaluate the work, a bunch of test suites are created and visualized. I can tell from the graphs that the v6 patches are performing very good in all the test cases. It's achieved by 880 lines of code in page-writeback.c. It's good that Jan's v2 does it merely in 300 lines of code, however it's still missing the code for (10)-(14) and the performance gaps are simply too large in many of the cases. IMHO it would need some non-trivial revisions to become a real candidate. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html