On Wed, Nov 12, 2014 at 7:28 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Wed, Nov 12, 2014 at 09:26:16AM -0500, Trond Myklebust wrote: >> On Wed, Nov 12, 2014 at 5:24 AM, Christoph Hellwig <hch@xxxxxx> wrote: >> > On Wed, Nov 12, 2014 at 09:27:10AM +1100, Dave Chinner wrote: >> >> To clarify what Christoph wrote, XFS updates i_version is updated >> >> once per transaction that modifies the inode. So if a VFS level >> >> operation results in multiple transactions then each transaction >> >> will but the version. >> >> >> >> It was implemented that way because nobody could tell me what the >> >> actual granularity requirement for change detection was. Hence what >> >> I implemented was "be able to detect any persistent change that is >> >> made" to cover all bases. >> > >> > Honestly the XFS implementation seems most sensible, and easiest to >> > verify for me. I don't really understand the rationale behind the >> > fairly convoluted NFS4_CHANGE_TYPE_IS_VERSION_COUNTER semantics, and >> > I doubt you could actually implemet them on any Unix-like semantics. >> > >> > Trond, given that the language in the standard is from you: >> > >> > 1) how do you expect to use NFS4_CHANGE_TYPE_IS_VERSION_COUNTER >> > semantics in the client >> >> Basically, I'd like to use it the same way that AFS does. I want to be >> able to issue an RPC call which does the equivalent of a single system >> call (e.g. mkdir(), write(), link(), unlink(), etc) and be able to >> predict what the effect should be on the change attribute (1 increment >> on the parent directory for a successful mkdir(), 1 increment on the >> file for a successful write(), ...) > > That's not the way the change version counter is implemented in the > VFS or any filesystem. It's a low level change primitive, not > something that is only updated on a syscall granularity. > > I just can't see how a change counter at the syscall level can be > made to work reliably. NFS clients are now being told about server > block maps, so any extent map modification done by the underlying > filesystem needs to bump the change count so if the client is > caching the block map it can be invalidated. And with functionality > like delayed allocation modifications the client needs to know aout > can happen at any time and so change count modification can not be > limited only to syscall activity. I didn't say it needs to be implemented in the VFS. Just that it needs to be implemented in a way that makes sense if you are doing the equivalent of a system call. Delayed allocations are a filesystem implementation detail that do not change the application visible data or metadata contents of the file; there should be no reason to have them reflected in something like the change attribute. As for pNFS blocks, I agree that the spec there is a little iffy, but the intention was, I believe, that the whole LAYOUTGET->LAYOUTCOMMIT should be considered to be a single filesystem transaction. However the iffiness there is the main reason why I made a distinction between pnfs vs. non-pnfs when describing the change attribute. >> so that I can detect if someone >> else has been modifying the file/directory/symlink while I wasn't >> looking and hence know when I need to invalidate my cached >> metadata+data for that object. > > The only way to use the change count sanely from the client is as a > "check-and-execute" cookie on the server. If the change count sent > by the client is unchanged at the server then the server can execute > the operation. It can then return the new cookie to the client for > the next operation. But we can't even do that sanely on Linux > because the check-and-execute operation needs to be atomic and hence > requires the filesystem to do it deep inside their transaction > subsystems once they've taken the locks it needs to ensure the > change count is stable. Applications are required to interact with the filesystem through a well-defined API. Application visible data and metadata changes can be (and are mostly) well defined w.r.t. that API. Where be the dragons? -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@xxxxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html