Re: write-behind on streaming writes

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 29 May 2012 10:35:46 -0700

On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@xxxxxxxxx> wrote:
>
> Actually O_SYNC is pretty close to the below code for the purpose of
> limiting the dirty and writeback pages, except that it's not on by
> default, hence means nothing for normal users.

Absolutely not.

O_SYNC syncs the *current* write, syncs your metadata, and just
generally makes your writer synchronous. It's just a f*cking moronic
idea. Nobody sane ever uses it, since you are much better off just
using fsync() if you want that kind of behavior. That's one of those
"stupid legacy flags" things that have no sane use.

The whole point is that doing that is never the right thing to do. You
want to sync *past* writes, and you never ever want to wait on them
unless you just sent more (newer) writes to the disk that you are
*not* waiting on - so that you always have more IO pending.

O_SYNC is the absolutely anti-thesis of that kind of "multiple levels
of overlapping IO". Because it requires that the IO is _done_ by the
time you start more, which is against the whole point.

> It seems to me all about optimizing the 1-dd case for desktop users,
> and the most beautiful thing about per-file write behind is, it keeps
> both the number of dirty and writeback pages low in the system when
> there are only one or two sequential dirtier tasks. Which is good for
> responsiveness.

Yes, but I don't think it's about a single-dd case - it's about just
trying to handle one common case (streaming writes) efficiently and
naturally. Try to get those out of the system so that you can then
worry about the *other* cases knowing that they don't have that kind
of big streaming behavior.

For example, right now our main top-level writeback logic is *not*
about streaming writes (just dirty counts), but then we try to "find"
the locality by making the lower-level writeback do the whole "write
back by chunking inodes" without really having any higher-level
information.

I just suspect that we'd be better off teaching upper levels about the
streaming. I know for a fact that if I do it by hand, system
responsiveness was *much* better, and IO throughput didn't go down at
all.

> Note that the above user space code won't work well when there are 10+
> dirtier tasks. It effectively creates 10+ IO submitters on different
> regions of the disk and thus create lots of seeks.

Not really much more than our current writeback code does. It
*schedules* data for writing, but doesn't wait for it until much
later.

You seem to think it was synchronous. It's not. Look at the second
sync_file_range() thing, and the important part is the "index-1". The
fact that you confused this with O_SYNC seems to be the same thing.
This has absolutely *nothing* to do with O_SYNC.

The other important part is that the chunk size is fairly large. We do
read-ahead in 64k kind of things, to make sense the write-behind
chunking needs to be in "multiple megabytes".  8MB is probably the
minimum size it makes sense.

The write-behind would be for things like people writing disk images
and video files. Not for random IO in smaller chunks.

                       Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html