On Tue, Sep 24, 2019 at 12:39 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > Stupid question: how is this any different to simply winding down > our dirty writeback and throttling thresholds like so: > > # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes Our dirty_background stuff is very questionable, but it exists (and has those insane defaults) because of various legacy reasons. But it probably _shouldn't_ exist any more (except perhaps as a last-ditch hard limit), and I don't think it really ends up being the primary throttling any more in many cases. It used to make sense to make it a "percentage of memory" back when we were talking old machines with 8MB of RAM, and having an appreciable percentage of memory dirty was "normal". And we've kept that model and not touched it, because some benchmarks really want enormous amounts of dirty data (particularly various dirty shared mappings). But out default really is fairly crazy and questionable. 10% of memory being dirty may be ok when you have a small amount of memory, but it's rather less sane if you have gigs and gigs of RAM. Of course, SSD's made it work slightly better again, but our "dirty_background" stuff really is legacy and not very good. The whole dirty limit when seen as percentage of memory (which is our default) is particularly questionable, but even when seen as total bytes is bad. If you have slow filesystems (say, FAT on a USB stick), the limit should be very different from a fast one (eg XFS on a RAID of proper SSDs). So the limit really needs be per-bdi, not some global ratio or bytes. As a result we've grown various _other_ heuristics over time, and the simplistic dirty_background stuff is only a very small part of the picture these days. To the point of almost being irrelevant in many situations, I suspect. > to start background writeback when there's 100MB of dirty pages in > memory, and then: > > # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes The thing is, that also accounts for dirty shared mmap pages. And it really will kill some benchmarks that people take very very seriously. And 200MB is peanuts when you're doing a benchmark on some studly machine that has a million iops per second, and 200MB of dirty data is nothing. Yet it's probably much too big when you're on a workstation that still has rotational media. And the whole memcg code obviously makes this even more complicated. Anyway, the end result of all this is that we have that balance_dirty_pages() that is pretty darn complex and I suspect very few people understand everything that goes on in that function. So I think that the point of any write-behind logic would be to avoid triggering the global limits as much as humanly possible - not just getting the simple cases to write things out more quickly, but to remove the complex global limit questions from (one) common and fairly simple case. Now, whether write-behind really _does_ help that, or whether it's just yet another tweak and complication, I can't actually say. But I don't think 'dirty_background_bytes' is really an argument against write-behind, it's just one knob on the very complex dirty handling we have. Linus