On Thu, Sep 21, 2023 at 10:02 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Thu, 2023-09-21 at 11:24 -0700, Linus Torvalds wrote: > > On Thu, 21 Sept 2023 at 04:21, Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > > > git@xxxxxxxxxxxxxxxxxxx:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-rc3.vfs.ctime.revert > > > > So for some reason pr-tracker-bot doesn't seem to have reacted to this > > pull request, but it's in my tree now. > > > > I *do* have one reaction to all of this: now that you have made > > "i_ctime" be something that cannot be accessed directly (and renamed > > it to "__i_ctime"), would you mind horribly going all the way, and do > > the same for i_atime and i_mtime too? > > > > The reason I ask is that I *really* despise "struct timespec64" as a type. > > > > I despise it inherently, but I despise it even more when you really > > use it as another type entirely, and are hiding bits in there. > > > > I despise it because "tv_sec" obviously needs to be 64-bit, but then > > "tv_nsec" is this horrible abomination. It's defined as "long", which > > is all kinds of crazy. It's bogus and historical. > > > > And it's wasteful. > > > > Now, in the case of i_ctime, you took advantage of that waste by using > > one (of the potentially 2..34!) unused bits for that > > "fine-granularity" flag. > > > > But even when you do that, there's up to 33 other bits just lying > > around, wasting space in a very central data structure. > > > > So it would actually be much better to explode the 'struct timespec64' > > into explicit 64-bit seconds field, and an explicit 32-bit field with > > two bits reserved. And not even next to each other, because they don't > > pack well in general. > > > > So instead of > > > > struct timespec64 i_atime; > > struct timespec64 i_mtime; > > struct timespec64 __i_ctime; > > > > where that last one needs accessors to access, just make them *all* > > have helper accessors, and make it be > > > > u64 i_atime_sec; > > u64 i_mtime_sec; > > u64 i_ctime_sec; > > u32 i_atime_nsec; > > u32 i_mtime_nsec; > > u32 i_ctime_nsec; > > > > and now that 'struct inode' should shrink by 12 bytes. > > > > I like it. > > > Then do this: > > > > #define inode_time(x) \ > > (struct timespec64) { x##_sec, x##_nsec } > > > > and you can now create a timespec64 by just doing > > > > inode_time(inode->i_atime) > > > > or something like that (to help create those accessor functions). > > > > Now, I agree that 12 bytes in the disaster that is 'struct inode' is > > likely a drop in the ocean. We have tons of big things in there (ie > > several list_heads, a whole 'struct address_space' etc etc), so it's > > only twelve bytes out of hundreds. > > > > But dammit, that 'timespec64' really is ugly, and now that you're > > hiding bits in it and it's no longer *really* a 'timespec64', I feel > > like it would be better to do it right, and not mis-use a type that > > has other semantics, and has other problems. > > > > > We have many, many inodes though, and 12 bytes per adds up! > > I'm on board with the idea, but...that's likely to be as big a patch > series as the ctime overhaul was. In fact, it'll touch a lot of the same > code. I can take a stab at that in the near future though. > > Since we're on the subject...another thing that bothers me with all of > the timestamp handling is that we don't currently try to mitigate "torn > reads" across the two different words. It seems like you could fetch a > tv_sec value and then get a tv_nsec value that represents an entirely > different timestamp if there are stores between them. > > Should we be concerned about this? I suppose we could do something with > a seqlock, though I'd really want to avoid locking on the write side. As luck would have it, if my calculations are correct, on x86-64 and with CONFIG_FS_POSIX_ACL=y, CONFIG_SECURITY=y (as they are on distro kernels), __i_ctime is exactly on split cache lines and maybe even worse (?), __i_ctime.tv_nsec and the QUERIED bit are on the same cache line with i_lock :-/ If we reorder the inode timestamps with i_size, we improve the situation for this specific and very common configuration. Maybe. Thanks, Amir.