On Wed, Nov 27, 2019 at 4:42 PM Steven Whitehouse <swhiteho@xxxxxxxxxx> wrote: > Hi, > > On 25/11/2019 17:05, Linus Torvalds wrote: > > On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse <swhiteho@xxxxxxxxxx> wrote: > >> Linus, is that roughly what you were thinking of? > > So the concept looks ok, but I don't really like the new flags as they > > seem to be gfs2-specific rather than generic. > > > > That said, I don't _gate_ them either, since they aren't in any > > critical code sequence, and it's not like they do anything really odd. > > > > I still think the whole gfs2 approach is broken. You're magically ok > > with using stale data from the cache just because it's cached, even if > > another client might have truncated the file or something. > > If another node tries to truncate the file, that will require an > exclusive glock, and in turn that means the all the other nodes will > have to drop their glock(s) shared or exclusive. That process > invalidates the page cache on those nodes, such that any further > requests on those nodes will find the cache empty and have to call into > the filesystem. > > If a page is truncated on another node, then when the local node gives > up its glock, after any copying (e.g. for read) has completed then the > truncate will take place. The local node will then have to reread any > data relating to new pages or return an error in case the next page to > be read has vanished due to the truncate. It is a pretty small window, > and the advantage is that in cases where the page is in cache, we can > directly use the cached page without having to call into the filesystem > at all. So it is page atomic in that sense. > > The overall aim here is to avoid taking (potentially slow) cluster locks > when at all possible, yet at the same time deliver close to local fs > semantics whenever we can. You can think of GFS2's glock concept (at > least as far as the inodes we are discussing here) as providing a > combination of (page) cache control and cluster (dlm) locking. > > > > > So you're ok with saying "the file used to be X bytes in size, so > > we'll just give you this data because we trust that the X is correct". > > > > But you're not ok to say "oh, the file used to be X bytes in size, but > > we don't want to give you a short read because it might not be correct > > any more". > > > > See the disconnect? You trust the size in one situation, but not in another one. > > Well we are not trusting the size at all... the original algorithm > worked entirely off "is this page in cache and uptodate?" and for > exactly the reason that we know the size in the inode might be out of > date, if we are not currently holding a glock in either shared or > exclusive mode. We also know that if there is a page in cache and > uptodate then we must be holding the glock too. > > > > > > I also don't really see that you *need* the new flag at all. Since > > you're doing to do a speculative read and then a real read anyway, and > > since the only thing that you seem to care about is the file size > > (because the *data* you will trust if it is cached), then why don't > > you just use the *existing* generic read, and *IFF* you get a > > truncated return value, then you go and double-check that the file > > hasn't changed in size? > > > > See what I'm saying? I think gfs2 is being very inconsistent in when > > it trusts the file size, and I don't see that you even need the new > > behavior that patch gives, because you might as well just use the > > existing code (just move the i_size check earlier, and then teach gfs2 > > to double-check the "I didn't get as much as I expected" case). We can identify short reads, but we won't get information about readahead back from generic_file_read_iter or filemap_fault. We could try to work around this with filesystem specific flags for ->readpage and ->readpages, but that would break down with multiple concurrent readers in addition to being a real mess. I'm currently out of better ideas that avoid duplicating the generic code. > > Linus > > I'll leave the finer details to Andreas here, since it is his patch, and > hopefully we can figure out a good path forward. We are perhaps also a > bit reluctant to go off and (nearly) duplicate code that is already in > the core vfs library functions, since that often leads to things getting > out of sync (our implementation of ->writepages is one case where that > happened in the past) and missing important bug fixes/features in some > cases. Hopefully though we can iterate on this a bit and come up with > something which will resolve all the issues, > > Steve. Thanks, Andreas