On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@xxxxxxxxx> wrote: > > Actually O_SYNC is pretty close to the below code for the purpose of > limiting the dirty and writeback pages, except that it's not on by > default, hence means nothing for normal users. Absolutely not. O_SYNC syncs the *current* write, syncs your metadata, and just generally makes your writer synchronous. It's just a f*cking moronic idea. Nobody sane ever uses it, since you are much better off just using fsync() if you want that kind of behavior. That's one of those "stupid legacy flags" things that have no sane use. The whole point is that doing that is never the right thing to do. You want to sync *past* writes, and you never ever want to wait on them unless you just sent more (newer) writes to the disk that you are *not* waiting on - so that you always have more IO pending. O_SYNC is the absolutely anti-thesis of that kind of "multiple levels of overlapping IO". Because it requires that the IO is _done_ by the time you start more, which is against the whole point. > It seems to me all about optimizing the 1-dd case for desktop users, > and the most beautiful thing about per-file write behind is, it keeps > both the number of dirty and writeback pages low in the system when > there are only one or two sequential dirtier tasks. Which is good for > responsiveness. Yes, but I don't think it's about a single-dd case - it's about just trying to handle one common case (streaming writes) efficiently and naturally. Try to get those out of the system so that you can then worry about the *other* cases knowing that they don't have that kind of big streaming behavior. For example, right now our main top-level writeback logic is *not* about streaming writes (just dirty counts), but then we try to "find" the locality by making the lower-level writeback do the whole "write back by chunking inodes" without really having any higher-level information. I just suspect that we'd be better off teaching upper levels about the streaming. I know for a fact that if I do it by hand, system responsiveness was *much* better, and IO throughput didn't go down at all. > Note that the above user space code won't work well when there are 10+ > dirtier tasks. It effectively creates 10+ IO submitters on different > regions of the disk and thus create lots of seeks. Not really much more than our current writeback code does. It *schedules* data for writing, but doesn't wait for it until much later. You seem to think it was synchronous. It's not. Look at the second sync_file_range() thing, and the important part is the "index-1". The fact that you confused this with O_SYNC seems to be the same thing. This has absolutely *nothing* to do with O_SYNC. The other important part is that the chunk size is fairly large. We do read-ahead in 64k kind of things, to make sense the write-behind chunking needs to be in "multiple megabytes". 8MB is probably the minimum size it makes sense. The write-behind would be for things like people writing disk images and video files. Not for random IO in smaller chunks. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html