Re: [PATCH v2] mm: implement write-behind policy for sequential file writes

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 24 Sep 2019 12:08:04 -0700

On Tue, Sep 24, 2019 at 12:39 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> Stupid question: how is this any different to simply winding down
> our dirty writeback and throttling thresholds like so:
>
> # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes

Our dirty_background stuff is very questionable, but it exists (and
has those insane defaults) because of various legacy reasons.

But it probably _shouldn't_ exist any more (except perhaps as a
last-ditch hard limit), and I don't think it really ends up being the
primary throttling any more in many cases.

It used to make sense to make it a "percentage of memory" back when we
were talking old machines with 8MB of RAM, and having an appreciable
percentage of memory dirty was "normal".

And we've kept that model and not touched it, because some benchmarks
really want enormous amounts of dirty data (particularly various dirty
shared mappings).

But out default really is fairly crazy and questionable. 10% of memory
being dirty may be ok when you have a small amount of memory, but it's
rather less sane if you have gigs and gigs of RAM.

Of course, SSD's made it work slightly better again, but our
"dirty_background" stuff really is legacy and not very good.

The whole dirty limit when seen as percentage of memory (which is our
default) is particularly questionable, but even when seen as total
bytes is bad.

If you have slow filesystems (say, FAT on a USB stick), the limit
should be very different from a fast one (eg XFS on a RAID of proper
SSDs).

So the limit really needs be per-bdi, not some global ratio or bytes.

As a result we've grown various _other_ heuristics over time, and the
simplistic dirty_background stuff is only a very small part of the
picture these days.

To the point of almost being irrelevant in many situations, I suspect.

> to start background writeback when there's 100MB of dirty pages in
> memory, and then:
>
> # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes

The thing is, that also accounts for dirty shared mmap pages. And it
really will kill some benchmarks that people take very very seriously.

And 200MB is peanuts when you're doing a benchmark on some studly
machine that has a million iops per second, and 200MB of dirty data is
nothing.

Yet it's probably much too big when you're on a workstation that still
has rotational media.

And the whole memcg code obviously makes this even more complicated.

Anyway, the end result of all this is that we have that
balance_dirty_pages() that is pretty darn complex and I suspect very
few people understand everything that goes on in that function.

So I think that the point of any write-behind logic would be to avoid
triggering the global limits as much as humanly possible - not just
getting the simple cases to write things out more quickly, but to
remove the complex global limit questions from (one) common and fairly
simple case.

Now, whether write-behind really _does_ help that, or whether it's
just yet another tweak and complication, I can't actually say. But I
don't think 'dirty_background_bytes' is really an argument against
write-behind, it's just one knob on the very complex dirty handling we
have.

            Linus