On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > Hum, so are we fine if i_version just changes (increases) for all inodes > > after a server crash? If I understand its use right, it would mean > > invalidation of all client's caches but that is not such a big deal given > > how frequent server crashes should be, right? Even if it's rare, it may be really painful when all your clients are forced to throw out and repopulate their caches after a crash. But, yes, maybe we can live with it. > > Because if above is acceptable we could make reported i_version to be a sum > > of "superblock crash counter" and "inode i_version". We increment > > "superblock crash counter" whenever we detect unclean filesystem shutdown. > > That way after a crash we are guaranteed each inode will report new > > i_version (the sum would probably have to look like "superblock crash > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > i_version numbers we gave away but did not write to disk but still...). > > Thoughts? How hard is this for filesystems to support? Do they need an on-disk format change to keep track of the crash counter? Maybe not, maybe the high bits of the i_version counters are all they need. > That does sound like a good idea. This is a 64 bit value, so we should > be able to carve out some upper bits for a crash counter without risking > wrapping. > > The other constraint here is that we'd like any later version of the > counter to be larger than any earlier value that was handed out. I think > this idea would still satisfy that. I guess we just want to have some back-of-the-envelope estimates of maximum number of i_version increments possible between crashes and maximum number of crashes possible over lifetime of a filesystem, to decide how to split up the bits. I wonder if we could get away with using the new crash counter only for *new* values of the i_version? After a crash, use the on disk i_version as is, and put off using the new crash counter until the next time the file's modified. That would still eliminate the risk of accidental reuse of an old i_version value. It still leaves some cases where the client could fail to notice an update indefinitely. All these cases I think have to assume that a writer made some changes that it failed to ever sync, so as long as we care only about close-to-open semantics perhaps those cases don't matter. I wonder if repeated crashes can lead to any odd corner cases. --b.