On Mon, Jun 10, 2019 at 06:55:15PM -1000, Linus Torvalds wrote: > On Mon, Jun 10, 2019 at 3:17 PM Kent Overstreet > <kent.overstreet@xxxxxxxxx> wrote: > > > Why does the regular page lock (at a finer granularity) not suffice? > > > > Because the lock needs to prevent pages from being _added_ to the page cache - > > to do it with a page granularity lock it'd have to be part of the radix tree, > > No, I understand that part, but I still think we should be able to do > the locking per-page rather than over the whole mapping. > > When doing dio, you need to iterate over old existing pages anyway in > that range (otherwise the "no _new_ pages" part is kind of pointless > when there are old pages there), so my gut feel is that you might as > well at that point also "poison" the range you are doin dio on. With > the xarray changes, we might be better at handling ranges. That was > one of the arguments for the xarrays over the old radix tree model, > after all. We could do that -- if there are pages (or shadow entries) in the XArray, replace them with "lock entries". I think we'd want the behaviour of faults / buffered IO be to wait on those entries being removed. I think the DAX code is just about ready to switch over to lock entries instead of having a special DAX lock bit. The question is what to do when there are _no_ pages in the tree for a range that we're about to do DIO on. This should be the normal case -- as you say, DIO users typically have their own schemes for caching in userspace, and rather resent the other users starting to cache their file in the kernel. Adding lock entries in the page cache for every DIO starts to look pretty painful in terms of allocating radix tree nodes. And it gets worse when you have sub-page-size DIOs -- do we embed a count in the lock entry? Or delay DIOs which hit in the same page as an existing DIO? And then I start to question the whole reasoning behind how we do mixed DIO and buffered IO; if there's a page in the page cache, why are we writing it out, then doing a direct IO instead of doing a memcpy to the page first, then writing the page back? IOW, move to a model where: - If i_dio_count is non-zero, buffered I/O waits for i_dio_count to drop to zero before bringing pages in. - If i_pages is empty, DIOs increment i_dio_count, do the IO and decrement i_dio_count. - If i_pages is not empty, DIO is implemented by doing buffered I/O and waiting for the pages to finish writeback. (needs a slight tweak to ensure that new DIOs can't hold off a buffered I/O indefinitely)