On Tue 05-11-13 15:12:45, Dave Chinner wrote: > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote: > > Something simple like “start writing at 16MB dirty on a single file” > > would probably avoid a lot of complexity at little real-world cost. > > That shouldn’t throttle dirtying memory above 16MB, but just start > > writeout much earlier than it does today. > > That doesn't solve the "slow device, large file" problem. We can > write data into the page cache at rates of over a GB/s, so it's > irrelevant to a device that can write at 5MB/s whether we start > writeback immediately or a second later when there is 500MB of dirty > pages in memory. AFAIK, the only way to avoid that problem is to > use write-through caching for such devices - where they throttle to > the IO rate at very low levels of cached data. Agreed. > Realistically, there is no "one right answer" for all combinations > of applications, filesystems and hardware, but writeback caching is > the best *general solution* we've got right now. > > However, IMO users should not need to care about tuning BDI dirty > ratios or even have to understand what a BDI dirty ratio is to > select the rigth caching method for their devices and/or workload. > The difference between writeback and write through caching is easy > to explain and AFAICT those two modes suffice to solve the problems > being discussed here. Further, if two modes suffice to solve the > problems, then we should be able to easily define a trigger to > automatically switch modes. > > /me notes that if we look at random vs sequential IO and the impact > that has on writeback duration, then it's very similar to suddenly > having a very slow device. IOWs, fadvise(RANDOM) could be used to > switch an *inode* to write through mode rather than writeback mode > to solve the problem aggregating massive amounts of random write IO > in the page cache... I disagree here. Writeback cache is also useful for aggregating random writes and making semi-sequential writes out of them. There are quite some applications which rely on the fact that they can write a file in a rather random manner (Berkeley DB, linker, ...) but the files are written out in one large linear sweep. That is actually the reason why SLES (and I believe RHEL as well) tune dirty_limit even higher than what's the default value. So I think it's rather the other way around: If you can detect the file is being written in a streaming manner, there's not much point in caching too much data for it. And I agree with you that we also have to be careful not to cache too few because otherwise two streaming writes would be interleaved too much. Currently, we have writeback_chunk_size() which determines how much we ask to write from a single inode. So streaming writers are going to be interleaved at this chunk size anyway (currently that number is "measured bandwidth / 2"). So it would make sense to also limit amount of dirty cache for each file with streaming pattern at this number. > So rather than treating this as a "one size fits all" type of > problem, let's step back and: > > a) define 2-3 different caching behaviours we consider > optimal for the majority of workloads/hardware we care > about. > b) determine optimal workloads for each caching > behaviour. > c) develop reliable triggers to detect when we > should switch between caching behaviours. > > e.g: > > a) write back caching > - what we have now > write through caching > - extremely low dirty threshold before writeback > starts, enough to optimise for, say, stripe width > of the underlying storage. > > b) write back caching: > - general purpose workload > write through caching: > - slow device, write large file, sync > - extremely high bandwidth devices, multi-stream > sequential IO > - random IO. > > c) write back caching: > - default > - fadvise(NORMAL, SEQUENTIAL, WILLNEED) > write through caching: > - fadvise(NOREUSE, DONTNEED, RANDOM) > - random IO > - sequential IO, BDI write bandwidth <<< dirty threshold > - sequential IO, BDI write bandwidth >>> dirty threshold > > I think that covers most of the issues and use cases that have been > discussed in this thread. IMO, this is the level at which we need to > solve the problem (i.e. architectural), not at the level of "let's > add sysfs variables so we can tweak bdi ratios". > > Indeed, the above implies that we need the caching behaviour to be a > property of the address space, not just a property of the backing > device. Yes, and that would be interesting to implement and not make a mess out of the whole writeback logic because the way we currently do writeback is inherently BDI based. When we introduce some special per-inode limits, flusher threads would have to pick more carefully what to write and what not. We might be forced to go that way eventually anyway because of memcg aware writeback but it's not a simple step. > IOWs, the implementation needs to trickle down from a coherent high > level design - that will define the knobs that we need to expose to > userspace. We should not be adding new writeback behaviours by > adding knobs to sysfs without first having some clue about whether > we are solving the right problem and solving it in a sane manner... Agreed. But the ability to limit amount of dirty pages outstanding against a particular BDI seems as a sane one to me. It's not as flexible and automatic as the approach you suggested but it's much simpler and solves most of problems we currently have. The biggest objection against the sysfs-tunable approach is that most people won't have a clue meaning that the tunable is useless for them. But I wonder if something like: 1) turn on strictlimit by default 2) don't allow dirty cache of BDI to grow over 5s of measured writeback speed won't go a long way into solving our current problems without too much complication... Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>