On Thu, 2022-10-20 at 12:39 +0200, Jan Kara wrote: > On Wed 19-10-22 14:47:48, Jeff Layton wrote: > > On Wed, 2022-10-19 at 19:23 +0200, Jan Kara wrote: > > > On Tue 18-10-22 13:04:34, Jeff Layton wrote: > > > > On Tue, 2022-10-18 at 17:17 +0200, Jan Kara wrote: > > > > > On Tue 18-10-22 10:21:08, Jeff Layton wrote: > > > > > > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote: > > > > > > > On Tue 18-10-22 06:35:14, Jeff Layton wrote: > > > > > > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote: > > > > > > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote: > > > > > > > > > > Trond is of the opinion that monotonicity is a hard requirement, and > > > > > > > > > > that we should not allow filesystems that can't provide that quality to > > > > > > > > > > report STATX_VERSION at all. His rationale is that one of the main uses > > > > > > > > > > for this is for backup applications, and for those a counter that could > > > > > > > > > > go backward is worse than useless. > > > > > > > > > > > > > > > > > > From the perspective of a backup program doing incremental backups, > > > > > > > > > an inode with a change counter that has a different value to the > > > > > > > > > current backup inventory means the file contains different > > > > > > > > > information than what the current backup inventory holds. Again, > > > > > > > > > snapshots, rollbacks, etc. > > > > > > > > > > > > > > > > > > Therefore, regardless of whether the change counter has gone > > > > > > > > > forwards or backwards, the backup program needs to back up this > > > > > > > > > current version of the file in this backup because it is different > > > > > > > > > to the inventory copy. Hence if the backup program fails to back it > > > > > > > > > up, it will not be creating an exact backup of the user's data at > > > > > > > > > the point in time the backup is run... > > > > > > > > > > > > > > > > > > Hence I don't see that MONOTONIC is a requirement for backup > > > > > > > > > programs - they really do have to be able to handle filesystems that > > > > > > > > > have modifications that move backwards in time as well as forwards... > > > > > > > > > > > > > > > > Rolling backward is not a problem in and of itself. The big issue is > > > > > > > > that after a crash, we can end up with a change attr seen before the > > > > > > > > crash that is now associated with a completely different inode state. > > > > > > > > > > > > > > > > The scenario is something like: > > > > > > > > > > > > > > > > - Change attr for an empty file starts at 1 > > > > > > > > > > > > > > > > - Write "A" to file, change attr goes to 2 > > > > > > > > > > > > > > > > - Read and statx happens (client sees "A" with change attr 2) > > > > > > > > > > > > > > > > - Crash (before last change is logged to disk) > > > > > > > > > > > > > > > > - Machine reboots, inode is empty, change attr back to 1 > > > > > > > > > > > > > > > > - Write "B" to file, change attr goes to 2 > > > > > > > > > > > > > > > > - Client stat's file, sees change attr 2 and assumes its cache is > > > > > > > > correct when it isn't (should be "B" not "A" now). > > > > > > > > > > > > > > > > The real danger comes not from the thing going backward, but the fact > > > > > > > > that it can march forward again after going backward, and then the > > > > > > > > client can see two different inode states associated with the same > > > > > > > > change attr value. Jumping all the change attributes forward by a > > > > > > > > significant amount after a crash should avoid this issue. > > > > > > > > > > > > > > As Dave pointed out, the problem with change attr having the same value for > > > > > > > a different inode state (after going backwards) holds not only for the > > > > > > > crashes but also for restore from backups, fs snapshots, device snapshots > > > > > > > etc. So relying on change attr only looks a bit fragile. It works for the > > > > > > > common case but the edge cases are awkward and there's no easy way to > > > > > > > detect you are in the edge case. > > > > > > > > > > > > > > > > > > > This is true. In fact in the snapshot case you can't even rely on doing > > > > > > anything at reboot since you won't necessarily need to reboot to make it > > > > > > roll backward. > > > > > > > > > > > > Whether that obviates the use of this value altogether, I'm not sure. > > > > > > > > > > > > > So I think any implementation caring about data integrity would have to > > > > > > > include something like ctime into the picture anyway. Or we could just > > > > > > > completely give up any idea of monotonicity and on each mount select random > > > > > > > prime P < 2^64 and instead of doing inc when advancing the change > > > > > > > attribute, we'd advance it by P. That makes collisions after restore / > > > > > > > crash fairly unlikely. > > > > > > > > > > > > Part of the goal (at least for NFS) is to avoid unnecessary cache > > > > > > invalidations. > > > > > > > > > > > > If we just increment it by a particular offset on every reboot, then > > > > > > every time the server reboots, the clients will invalidate all of their > > > > > > cached inodes, and proceed to hammer the server with READ calls just as > > > > > > it's having to populate its own caches from disk. > > > > > > > > > > Note that I didn't propose to increment by offset on every reboot or mount. > > > > > I have proposed that inode_maybe_inc_iversion() would not increment > > > > > i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P > > > > > << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected > > > > > on filesystem mount. > > > > > > > > > > This will not cause cache invalidation after a clean unmount + remount. It > > > > > will cause cache invalidation after a crash, snapshot rollback etc., only for > > > > > inodes where i_version changed. If P is suitably selected (e.g. as being a > > > > > prime), then the chances of collisions (even after a snapshot rollback) are > > > > > very low (on the order of 2^(-50) if my piece of envelope calculations are > > > > > right). > > > > > > > > > > So this should nicely deal with all the problems we've spotted so far. But > > > > > I may be missing something... > > > > > > > > > > > > Got it! That makes a lot more sense. Thinking about this some more... > > > > > > > > What sort of range for P would be suitable? > > > > > > > > Every increment would need to be by (shifted) P, so we can't choose too > > > > large a number. Queries are pretty rare vs. writes though, so that > > > > mitigates the issue somewhat. > > > > > > Well, I agree that for large P the counter would wrap earlier. But is that > > > a problem? Note that if P is a prime (indivisible by 2 is enough), then the > > > counter would get to already used value still only after 2^63 steps. Thus if > > > we give up monotonicity and just treat the counter as an opaque cookie, we > > > do not have to care about wrapping. > > > > > > Sure given different P is selected for each mount the wrapping argument > > > does not hold 100% but here comes the advantage of primes - if you have two > > > different primes P and Q, then a collision means that k*P mod 2^63 = l*Q > > > mod 2^63 and that holds for exactly one pair k,l from 1..2^63 range. So the > > > chances of early collision even after selecting a different prime on each > > > mount are *very* low. > > > > I think we'll have to start avoiding 1 as a value for P if we do this, > > but the rest makes sense. I like this idea, Jan! > > Yes, 1 is kind of special so we should better avoid it in this scheme. > Especially if we're going to select only smaller primes. > > > > So I think we should select from a relatively large set of primes so that > > > the chance of randomly selecting the same prime (and thus reissuing the > > > same change attr for different inode state sometime later) are small. > > > > > > > Monotonicity allows you to discard "old" attr updates. For instance, > > sometimes a NFS GETATTR response may be delayed for various reasons. If > > the client sees a change attr that is provably older than one it has > > already seen, it can discard the update. So, there is value in servers > > advertising that property, and NFSv4.2 has a way to do that. > > > > The Linux NFS client (at least) uses the same trick we do with jiffies > > to handle wrapping for MONOTONIC values. We should be able to advertise > > MONOTONIC as long as the client isn't comparing values that are more > > than ~2^62 apart. > > > > Once we start talking about applications storing these values for > > incremental backups, then the time between checks could be very long. > > > > So, I think we don't want _too_ large a value for P. The big question is > > how many individual change attr increments do we need to account for? > > > > We have 64 bits total (it's an atomic64_t). We consume the lowest bit > > for the QUERIED flag. That leaves us 63 bits of counter (currently). > > When we increment by a larger value, we're effectively decreasing the > > size of the counter. > > Yes, the larger value of P we take the sooner it will wrap which defeats > comparisons attempting to establish any ordering of change cookie values. > > > Let's assume a worst case of one increment per microsecond, interleaved > > by queries (so that they have to be real increments). 2^48 microseconds > > is close to 9 years. > > > > That leaves 15 bits for the P, which is primes from 3..32749. Is that a > > large enough pool of prime numbers? > > Well, there are ~3000 primes in this range so that gives you a 1/3000 > chance that after a crash, backup restore, snapshot rollback etc. you will > pick the same prime which results in collisions of change cookies and thus > possibility of data corruption. Is that low enough chance? The events I > mention above should be relatively rare but given the number of machines > running this code I would think the collision is bound to happen and the > consequences could be ... unpleasant. That's why I would prefer to pick > primes at least say upto 1m (there are ~78k of those). But that makes > wrapping more frequent (~100 days with 1us update period). Probably still > usable for NFS but not really for backup purposes. So I'm not sure we > should be advertising the values have any ordering. > Ok. I'll aim for using values between 3 and 1M and see how that looks. We should be able to tune this to some degree as well. > If the last used value would be persisted (e.g. in the filesystem's > superblock), we could easily make sure the next selected P is different so > in that case we could get away with a smaller set of primes but it would > require filesystem on-disk format changes which has its own drawbacks. But > that would be at least some path forward for providing change cookies that > can be ordered on larger timescales. > Persisting just the last one might not be sufficient. If the machine crashes several times then you could still end up re-using the same P value. i_version is only incremented on changes, so if you're unlucky and it's only incremented again when the duplicate value of P comes up, you're back to the same problem. We might want to keep a record of the last several P values? OTOH, there is always going to be _some_ way to defeat this. At some point we just have to decide that a scenario is unlikely enough that we can ignore it. > > It looks like the kernel already has some infrastructure for handling > > primes in lib/math/prime_numbers.c. We could just select a global P > > value to use on every reboot, or just have filesystems set their own > > (maybe in a new field in the superblock?) > > IMO P needs to be selected on each mount to reliably solve the "restore > from backup" and "snapshot rollback" scenarios. I agree it can be a new > field in the VFS part of the superblock so that it is accessible by the > iversion handling code. > Sounds good. FWIW, I think I'm going to have to approach this in at least three patchsets: 1) clean up the presentation of the value, and plumb it through struct kstat (aiming for v6.2 for this part). 2) start incrementing the value after a write in addition to, or instead of before a write. (I have patches for tmpfs, ext4 and btrfs -- maybe v6.3?) 3) change the increment to use a prime number we select at mount time to ward off rollback issues. (still scoping out this part) Thanks! -- Jeff Layton <jlayton@xxxxxxxxxx>