On Thu, May 05, 2022 at 02:58:21PM +1000, Dave Chinner wrote: > On Tue, May 03, 2022 at 07:39:58AM +0100, Matthew Wilcox (Oracle) wrote: > > This is very much in development and basically untested, but Damian > > started describing to me something that he wanted, and I told him he > > was asking for the wrong thing, and I already had this patch series > > in progress. If someone wants to pick it up and make it mergable, > > that'd be grand. > > That've very non-descriptive. Saying "someone wanted something, I said it's > wrong, so here's a patch series about something else" doesn't tell me anything > about the problem that Damien was trying to solve. Sorry about that. I was a bit jet-lagged when I wrote it. > > The idea is that an O_SYNC write is always going to want to write, and > > we know that at the time we're storing into the page cache. So for an > > otherwise clean folio, we can skip the part where we dirty the folio, > > find the dirty folios and wait for their writeback. > > What exactly is this shortcut trying to optimise away? A bit of CPU > time? > > O_SYNC is already a write-through operation - we just call > filemap_write_and_wait_range() once we've copied the data into the > page cache and dirtied the page. What does skipping the dirty page > step gain us? Two things; the original reason I was doing this, and Damien's reason. My reason: a small write to a large folio will cause the entire folio to be dirtied and written. This is unnecessary with O_SYNC; we're about to force the write anyway; we may as well do the write of the part of the folio which is modified, and skip the whole dirtying step. Damien's reason: It's racy. Somebody else (... even vmscan) could cause folios to be written out of order. This matters for ZoneFS because writing a file out of order is Not Allowed. He was looking at relaxing O_DIRECT, but I think what he really wants is a writethrough page cache. > > We can just mark the > > folio as writeback-in-progress and start the IO there and then (where we > > know exactly which blocks need to be written, so possibly a smaller I/O > > than writing the entire page). The existing "find dirty pages, start > > I/O and wait on them" code will end up waiting on this pre-started I/O > > to complete, even though it didn't start any of its own I/O. > > > > The important part is patch 9. Everything before it is boring prep work. > > I'm in two minds about whether to keep the 'write_through' bool, or > > remove it. So feel to read patches 9+10 squashed together, or as if > > patch 10 doesn't exist. Whichever feels better. > > > > The biggest problem with all this is that iomap doesn't have the necessary > > information to cause extent allocation, so if you do an O_SYNC write > > to an extent which is HOLE or DELALLOC, we can't do this optimisation. > > Maybe that doesn't really matter for interesting applications. I suspect > > it doesn't matter for ZoneFS. > > This seems like a lot of complexity for only partial support. It > introduces races with page dirtying and cleaning, it likely has > interesting issues with all the VM dirty/writeback accounting > (because this series is using a completion path that expects the > submission path has done it's side of the accounting) and it only > works in certain preconditions are met. If we want to have better O_SYNC support, I think we can improve those conditions. For example, XFS could preallocate the blocks before calling into iomap. Since it's an O_SYNC write, everything is already terrible.