On Thu, 21 Sept 2023 at 11:51, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > We have many, many inodes though, and 12 bytes per adds up! That was my thinking, but honestly, who knows what other alignment issues might eat up some - or all - of the theoreteical 12 bytes. It might be, for example, that the inode is already some aligned size, and that the allocation alignment means that the size wouldn't *really* shrink at all. So I just want to make clear that I think the 12 bytes isn't necessarily there. Maybe you'd get it, maybe it would be hidden by other things. My biggest impetus was really that whole abuse of a type that I already disliked for other reasons. > I'm on board with the idea, but...that's likely to be as big a patch > series as the ctime overhaul was. In fact, it'll touch a lot of the same > code. I can take a stab at that in the near future though. Yea, it's likely to be fairly big and invasive. That was one of the reasons for my suggested "inode_time()" macro hack: using the macro argument concatenation is really a hack to "gather" the pieces based on name, and while it's odd and not a very typical kernel model, I think doing it that way might allow the conversion to be slightly less painful. You'd obviously have to have the same kind of thing for assignment. Without that kind of name-based hack, you'd have to create all these random helper functions that just do the same thing over and over for the different times, which seems really annoying. > Since we're on the subject...another thing that bothers me with all of > the timestamp handling is that we don't currently try to mitigate "torn > reads" across the two different words. It seems like you could fetch a > tv_sec value and then get a tv_nsec value that represents an entirely > different timestamp if there are stores between them. Hmm. I think that's an issue that we have always had in theory, and have ignored because it's simply not problematic in practice, and fixing it is *hugely* painful. I suspect we'd have to use some kind of sequence lock for it (to make reads be cheap), and while it's _possible_ that having the separate accessor functions for reading/writing those times might help things out, I suspect the reading/writing happens for the different times (ie atime/mtime/ctime) together often enough that you might want to have the locking done at an outer level, and _not_ do it at the accessor level. So I suspect this is a completely separate issue (ie even an accessor doesn't make the "hugely painful" go away). And probably not worth worrying about *unless* somebody decides that they really really care about the race. That said, one thing that *could* help is if people decide that the right format for inode times is to just have one 64-bit word that has "sufficient resolution". That's what we did for "kernel time", ie "ktime_t" is a 64-bit nanosecond count, and by being just a single value, it avoids not just the horrible padding with 'struct timespec64', it is also dense _and_ can be accessed as one atomic value. Sadly, that "sufficient resolution" couldn't be nanoseconds, because 64-bit nanoseconds isn't enough of a spread. It's fine for the kernel time, because 2**63 nanoseconds is 292 years, so it moved the "year 2038" problem to "year 2262". And that's ok when we're talking about times that are kernel running times and we haev a couple of centuries to say "ok, we'll need to make it be a bigger type", but when you save the values to disk, things are different. I suspect filesystem people are *not* willing to deal with a "year 2262" issue. But if we were to say that "a tenth of microsecond resolution is sufficient for inode timestamps", then suddenly 64 bits is *enormous*. So we could do a // tenth of a microseconds since Jan 1, 1970 typedef s64 fstime_t; and have a nice dense timestamp format with reasonable - but not nanosecond - accuracy. Now that 292 year range has become 29,247 years, and filesystem people *might* find the "year-31k" problem acceptable. I happen to think that "100ns timestamp resolution on files is sufficient" is a very reasonable statement, but I suspect that we'll still find lots of people who say "that's completely unacceptable" both to that resolution, and to the 31k-year problem. But wouldn't it be nice to have just one single "fstime_t" for file timestamps, the same way we have "ktime_t" for CPU timestamps? Then we'd save even more space in the 'struct inode'.... Linus