On Wed, Nov 13, 2019 at 08:11:53AM +1100, Dave Chinner wrote: > On Tue, Nov 12, 2019 at 06:00:19AM +0200, Amir Goldstein wrote: > > On Tue, Nov 12, 2019 at 12:35 AM Darrick J. Wong > > <darrick.wong@xxxxxxxxxx> wrote: > > > > > > On Mon, Nov 11, 2019 at 11:36:30PM +0200, Amir Goldstein wrote: > > > > Use similar on-disk encoding for timestamps as ext4 to push back > > > > the y2038 deadline to 2446. > > > > > > > > The encoding uses the 2 free MSB in 32bit nsec field to extend the > > > > seconds field storage size to 34bit. > > > > > > > > Those 2 bits should be zero on existing xfs inodes, so the extended > > > > timestamp range feature is declared read-only compatible with old > > > > on-disk format. > > > > > > What do you think about making the timestamp field a uint64_t counting > > > nanoseconds since Dec 14 09:15:53 UTC 1901 (a.k.a. the minimum datetime > > > we support with the existing encoding scheme)? Instead of using the > > > upper 2 bits of the nsec field for an epoch encoding, which ext4 screwed > > > up years ago and has not fully fixed? > > > > The advantage of the ext4 scheme is that it is more backward compatible. > > Darrick an I had a long discussion about this on #xfs a few weeks > ago (22nd october). > > Discussion went like this: > > <djwong> btw, dchinner, were one to try to solve the y2038 problem on xfs, how would one do it? > <djwong> 1) write tests to make sure we can store/retrieve the extreme ends of the existing timestamps; then > <djwong> 2) use empty di_pad bytes to extend each timestamp field width; then > <djwong> 3) figure out what the values of the extra byte are (epochs moving forward from the unix epoch; and we don't care about supporting dates before 1900); then > <djwong> 4) expand test to cover new intervals? > <dchinner> djwong: pretty much > <dchinner> the epoch extending patch I originally proposed is here: https://lore.kernel.org/linux-xfs/20140602002822.GQ14410@dastard/ > <djwong> also it occurred to me that one could use the top two bits of the nsec fields to make it a 10-bit extension of the seconds fields > <dchinner> I've probably got a more recent version somewhere in a stack somewhere around here > <dchinner> didn't ext4 use part of the nsec field like that? > <djwong> yeah, they have 34 bit dates now > <dchinner> ISTR it's got some horrifically complex encoding of the timestamp > <djwong> yes, it does > <djwong> they did epochs rolling forward from the current one > <dchinner> we could just do the 34 bit second time in a sane way > <dchinner> because all the timestamps are contiguous on disk > <dchinner> i.e. if a SB flag is set, the timestamp on disk is an opaque 64 bit field > <dchinner> upper 30 bits are the nsec field, lower 34 bits are the seconds field > <dchinner> similar to how we encode BMBT extent records > <dchinner> always unsigned, so we don't support dates before 1970 at all.... > <djwong> hmm, with that scheme (uint t_sec:34; uint t_nsec: 30;} I guess that gets us to the year 2514 > <djwong> and provided nobody cares about pre-1970 dates or the ability to store negative t_nsec > <dchinner> djwong: if XFS is still in use in 2514, then I'm not going to care about it :) > <djwong> [narrator] But what Dave doesn't know is that the three XFS maintainers will be uploaded into the Cloud in 2046 to maintain XFS in perpetuity.... :D > <dchinner> The current Dave doesn't care about that :) > <djwong> hmm even if we did {signed int t_sec:34;} that would still get us to 2242 > <dchinner> djwong: I just don't see the point of having signed timestamps > <djwong> admittedly, we don't need timestamps dating back to the 1700s > <djwong> but then, what if we set the new epoch to 1993? > <djwong> (or, heck, 2000?) > <djwong> i mean, i guess it doesn't matter if we have dates going to 2514 or 2544 > <dchinner> what, have an on-disk epoch that is different to the unix epoch? > <djwong> yes :D > <djwong> "In the year 2525, if XFS is still alive..." ♪♪ > <dchinner> then we definitely have unsigned timestamps on disk > <dchinner> set the epoch to ~1900, and we handle the legacy negative 32-bit int timestamp range as well. > <djwong> that could also work > <djwong> I don't anticipate being around in 2444 > > Basically, we've both looked at what ext4 has done and don't want to > do that - it's an awful, complex hack and it's had quite a few bugs > in it since it was introduced that went a long time before being > noticed. When that conversation happened, I was still thinking of using the top 34 bits for seconds and the bottom 30 bits for nsec, which is not where my brain went this month. I've since moved on to a u64 nsec counter, which gets us to 2486, which is a whole forty more years past ext4!!! > > I would like to have an upgrade procedure that is simple and I don't like > > the idea of having two completely different time encodings in the code > > (and on disk) if I can help it. > > Backwards comaptible in-place upgrades are largely a myth: we don't > allow changes to the on-disk format without feature bits that limit > what old kernels can do with new formats. In this case it requires > an incompat bit because the moment an upper bit in the ns field is > set then the timestamps go bad on old kernels. Hence it's not a > compatible change and filesytems with this format cannot be mounted > on kernels that don't support it. > > So, an in-place upgrade process is a one-way operation - once you > start converting and have >2038 dates, there is no going back > without an offline admin operation. That means there's no real need > to try to retain the old format at all. IOWs, for in-place upgrade, > all we need is an inode flag to indicate what format the timestamp > is in once the superblock incompat flag is set. Eventually the SB > flag becomes the mkfs default, and then eventually it becomes the > only supported format. This is what we've done before for things > like NLINK, ATTR2, etc. /me is confused, are you advocating for an upgrade path that is: (1) admin sets incompat feature; (2) as people try to store dates beyond 2038 we set an inode flag and write the timestamp in the new format? I guess we could do that. I'd kinda thought that we'd just set an incompat flag and users simply have to backup, reformat, and reinstall. OTOH it's a fairly minor update so maybe we can support one way upgrade. > > IIUC, you are implying that the ext4 scheme is more prone to human > > programming errors? that should be addressed with proper unit testing > > IMO and besides, we can learn from ext4 past bugs (not sure that my > > implementation did), so those could be listed also in the pros column. Well... Ted added a comment to ext4 about how the encoding had been screwed up, along with some #if'd out code that would some day take its place and do the encoding correctly... but Christoph later ripped it out since it's basically an incompat format change... > We're not implying anything - there's been several actual bugs in > the encoding scheme that weren't noticed or fixed for quite a long > time. What we've learnt from this is that complexity in timestamp > encoding only leads to bugs, so given the choice we should really do > something simpler. ...so yes, let's try for something simpler. > > One thing I wasn't certain about is that it seems that xfs (and xfs_repair) > > allows for negative nsec value. Not sure if that is intentional and why. > > I suppose it is an oversight? That is something that xfs_repair would > > need to check and fix before upgrade if we do go with the epoch bits. > > It's not an oversight - it's somethign the on-disk format allowed. > Who knows if it ever got used (or how it got used), but it's > somethign we can only fix by changing the on-disk format (as you can > see from the discussion above). The disk format allows it; scrub warns about it, and the kernel at least in theory clamps the nsec value to 0...1e9. > IOWs, we pretty much decided on a new 64 bit encoding format using a > epoch of 1900 and a unsigned 64bit nanosecond counter to get us a > range of almost 600 years from year 1900-2500. It's simple, easy to > encode/decode, and very easy to validate. It's also trivially easy > to do in-place upgrades with an inode flag.... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx