On Wed, Nov 13, 2019 at 02:20:44PM +0100, Arnd Bergmann wrote: > On Wed, Nov 13, 2019 at 5:59 AM Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: > > On Wed, Nov 13, 2019 at 06:42:09AM +0200, Amir Goldstein wrote: > > > [CC:Arnd,Deepa] > > > On Wed, Nov 13, 2019 at 5:56 AM Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote: > > > > > > > > On Wed, Nov 13, 2019 at 08:11:53AM +1100, Dave Chinner wrote: > > > > > On Tue, Nov 12, 2019 at 06:00:19AM +0200, Amir Goldstein wrote: > > > > > > On Tue, Nov 12, 2019 at 12:35 AM Darrick J. Wong > > > > > > <darrick.wong@xxxxxxxxxx> wrote: > > > > > > > > > > I didn't get why you dropped the di_pad idea (Arnd's and Dave's patches). > > > Is it to save the pad for another rainy day? > > > > Something like that. Someone /else/ can worry about the y2486 problem. > > > > <duck> > > > > Practically speaking I'd almost rather drop the precision in order to > > extend the seconds range, since timestamp updates are only precise to > > HZ anyway. > > I think there would be noticeable overhead from anything that requires > a 64-bit division to update a timestamp, or to read it into an in-memory > inode, especially on 32-bit architectures. Comapred to actually allocating the inode, initialising it, etc? The cost of a division is noise and, as such, is completely irrelevant. It's classic premature micro-optimisation of CPU usage without any regard for the context in which the division might occur. Seriously, XFS is not a filesystem for small, poorly performing 32 bit CPUs. It's a filesystem for 64 bit servers with hundreds of CPUs, terabytes of memory and storage capable of millions of IOPs. All I care about is that it /works/ on tiny systems and doesn't corrupt filesystems - we've already made so many compromises in terms of algorithmic scalability and memory usage that XFS performance does not scale down effectively to tiny machines and tiny filesystems. IOWs, if you have a slow, poorly implemented 32 bit CPU with limited RAM and storage capability, then XFS is not the filesystem for your hardware. And in that context, the cost of a 64bit integer division is largely irrelevant to us. > I think the reason why separate seconds/nanoseconds are easy is that > this is what both the in-kernel and user space interface are based around. That's also largely irrelevant to what we store on disk. We store lots of stuff in encoded data structures because it's more space efficient to do so. We trade off CPU time for space efficiency -everywhere- in the XFS filesystem, because CPU is far cheaper than the RAM to store it and cheaper than moving information to/from disk. > We clearly don't use the precision, but anything else is more expensive > at runtime. It's noise in runtime CPU profiles. It's far more important for the filesystem on-disk format to be sane, scalable, maintainable and repairable than it is to save a CPU cycle or two in a path that is several thousand CPU instructions long already.... > > > Is there any core xfs developer that was going to tackle this? > > > > > > I'm here, so if you need my help moving things forward let me know. > > > > I wrote a trivial garbage version this afternoon, will have something > > more polished tomorrow. None of this is 5.6 material, we have time. > > I think from a user perspective, it would be the nicest to just add the The *user* does not see or know anything to do with what is on disk, nor does the kernel code outside the disk->memory translation functions. What the users passes to/from the kernel is irrelevant when discussing how we store something on disk. Indeed, what the kernel passes to the filesystem is largely irrelevant, too :) > the existing behavior: setting a timestamp after 2038 using utimensat() > silently wraps the seconds back to the regular epoch. With the > extension patch, you get the correct results as long as the inode was > both written and read on a new enough kernel, while all pre-5.4 > kernels produce the same incorrect data that they always have. Feature bits prevent the new format being read on old kernels. We can't allow an old kernel to parse an structure that it will present to the user incorrectly because it was written by a more recent kernel. Changing timestamp formats means old kernels can no longer mount that filesystem, and that goes for your changes as well - an old kernel will silently mishandle a >2038 date, and may even detect it as corruption because we expect padding to be zero on unused on-disk fields. Worse, older kernels will overwrite the new epoch fields with zeros when the inode is re-written, destroying the >y2038 timestamp information that is in that padding. IOWs, old kernels do not preserve stuff written into on-disk strucutre padding - the expect it to be zero and will write zeros there whenever the structure is written to disk. That means these epoch based y2038k format changes are not forwards or backwards compatible - you can't interchange the filesystem between kernels that do/don't support y2038k timestamps and expect it to works correctly - feature bits are needed to prevent the timestamps from being mis-interpretted or silently corrupted. That means it's a one-way conversion. Hence if the old format wrapped back to <1970, then new value written to disk will encode that <1970 date to disk *in the new format*. Nothing gets lost or changed in the process, and it's clear that an old kernel cannot mangle it because it can no longer mount the filesystem that holds new format timestamps.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx