On Sat, Sep 23, 2023 at 3:43 AM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Thu, 21 Sept 2023 at 11:51, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > We have many, many inodes though, and 12 bytes per adds up! > > That was my thinking, but honestly, who knows what other alignment > issues might eat up some - or all - of the theoreteical 12 bytes. > > It might be, for example, that the inode is already some aligned size, > and that the allocation alignment means that the size wouldn't > *really* shrink at all. > > So I just want to make clear that I think the 12 bytes isn't > necessarily there. Maybe you'd get it, maybe it would be hidden by > other things. > > My biggest impetus was really that whole abuse of a type that I > already disliked for other reasons. > > > I'm on board with the idea, but...that's likely to be as big a patch > > series as the ctime overhaul was. In fact, it'll touch a lot of the same > > code. I can take a stab at that in the near future though. > > Yea, it's likely to be fairly big and invasive. That was one of the > reasons for my suggested "inode_time()" macro hack: using the macro > argument concatenation is really a hack to "gather" the pieces based > on name, and while it's odd and not a very typical kernel model, I > think doing it that way might allow the conversion to be slightly less > painful. > > You'd obviously have to have the same kind of thing for assignment. > > Without that kind of name-based hack, you'd have to create all these > random helper functions that just do the same thing over and over for > the different times, which seems really annoying. > > > Since we're on the subject...another thing that bothers me with all of > > the timestamp handling is that we don't currently try to mitigate "torn > > reads" across the two different words. It seems like you could fetch a > > tv_sec value and then get a tv_nsec value that represents an entirely > > different timestamp if there are stores between them. > > Hmm. I think that's an issue that we have always had in theory, and > have ignored because it's simply not problematic in practice, and > fixing it is *hugely* painful. > > I suspect we'd have to use some kind of sequence lock for it (to make > reads be cheap), and while it's _possible_ that having the separate > accessor functions for reading/writing those times might help things > out, I suspect the reading/writing happens for the different times (ie > atime/mtime/ctime) together often enough that you might want to have > the locking done at an outer level, and _not_ do it at the accessor > level. > > So I suspect this is a completely separate issue (ie even an accessor > doesn't make the "hugely painful" go away). And probably not worth > worrying about *unless* somebody decides that they really really care > about the race. > > That said, one thing that *could* help is if people decide that the > right format for inode times is to just have one 64-bit word that has > "sufficient resolution". That's what we did for "kernel time", ie > "ktime_t" is a 64-bit nanosecond count, and by being just a single > value, it avoids not just the horrible padding with 'struct > timespec64', it is also dense _and_ can be accessed as one atomic > value. Just pointing out that xfs has already changed it's on-disk timestamp format to this model (a.k.a bigtime), but the in-core inode still uses the timespec64 of course. The vfs can inprise from this model. > > Sadly, that "sufficient resolution" couldn't be nanoseconds, because > 64-bit nanoseconds isn't enough of a spread. It's fine for the kernel > time, because 2**63 nanoseconds is 292 years, so it moved the "year > 2038" problem to "year 2262". Note that xfs_bigtime_to_unix(XFS_BIGTIME_TIME_MAX) is in year 2486, not year 2262, because there was no need to use the 64bit to go backwards to year 1678. > > And that's ok when we're talking about times that are kernel running > times and we haev a couple of centuries to say "ok, we'll need to make > it be a bigger type", but when you save the values to disk, things are > different. I suspect filesystem people are *not* willing to deal with > a "year 2262" issue. > Apparently, they are willing to handle the "year 2486" issue ;) > But if we were to say that "a tenth of microsecond resolution is > sufficient for inode timestamps", then suddenly 64 bits is *enormous*. > So we could do a > > // tenth of a microseconds since Jan 1, 1970 > typedef s64 fstime_t; > > and have a nice dense timestamp format with reasonable - but not > nanosecond - accuracy. Now that 292 year range has become 29,247 > years, and filesystem people *might* find the "year-31k" problem > acceptable. > > I happen to think that "100ns timestamp resolution on files is > sufficient" is a very reasonable statement, but I suspect that we'll > still find lots of people who say "that's completely unacceptable" > both to that resolution, and to the 31k-year problem. > I am guessing that you are aware of the Windows/SMB FILETIME standard which is 64bit in 100ns units (since 1601). So the 31k-year "problem" is very widespread already. But the resolution change is counter to the purpose of multigrain timestamps - if two syscalls updated the same or two different inodes within a 100ns tick, apparently, there are some workloads that care to know about it and fs needs to store this information persistently. Thanks, Amir.