Re: [GIT PULL] xfs: new code for 6.7

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Nov 08, 2023 at 01:29:16PM -0800, Linus Torvalds wrote:
> On Wed, 8 Nov 2023 at 02:19, Chandan Babu R <chandanbabu@xxxxxxxxxx> wrote:
> >
> > I had performed a test merge with latest contents of torvalds/linux.git.
> >
> > This resulted in merge conflicts. The following diff should resolve the merge
> > conflicts.
> 
> Well, your merge conflict resolution is the same as my initial
> mindless one, but then when I look closer at it, it turns out that
> it's wrong.
> 
> It's wrong not because the merge itself would be wrong, but because
> the conflict made me look at the original, and it turns out that
> commit 75d1e312bbbd ("xfs: convert to new timestamp accessors") was
> buggy.
> 
> I'm actually surprised the compilers don't complain about it, because
> the bug means that the new
> 
>         struct timespec64 ts;
> 
> temporary isn't actually initialized for the !XFS_DIFLAG_NEWRTBM case.
> 
> The code does
> 
>   xfs_rtpick_extent(..)

Oh gosh.  Dave might have other things to say, but xfs_rtpick_extent is
the sort of function that I hate with the power of 1,000 suns.

Back in 2.6.x it apparently did this:

	seqp = (__uint64_t *)&mp->m_rbmip->i_d.di_atime;

At the time, xfs_inode.id.di_atime was a struct xfs_ictimestamp:

typedef struct xfs_ictimestamp {
	__int32_t	t_sec;		/* timestamp seconds */
	__int32_t	t_nsec;		/* timestamp nanoseconds */
} xfs_ictimestamp_t;

So the rt allocator thinks its maintaining a u64 new file counter in the
bitmap file's atime.  The lower 32bits ended up in t_sec, and the upper
32bits ended up in t_nsec.

At some point (4.6?) the function started using the VFS i_atime field
instead of the di_atime field.  On 32-bit systems the struct timespec
was still a struct of two int32 values and everything kept working the
way it always had.

On 64-bit systems, tv_sec is 64-bits which means the sequence counter
was only stored (incore, anyway) in tv_sec.  XFS truncates the upper
32-bits when writing the inode to disk because (at the time) it didn't
handle y2038.

IOWs, we broke the ondisk format in 2016 and nobody noticed.  Because
the allocator calls xfs_highbit64 on the sequence counter, the only
observable behavior change would be the starting location of a free
space search for the first rt file allocation after an upgrade from 4.5
to a newer kernel on a 64-bit machine.

(Or going back, obviously)

But then in 4.18 or so, the VFS inode switched to timespec64, at which
point /all/ of the 32-bit kernels "migrated" to storing the sequence
counter in tv_sec and truncating it when it goes out to disk.

Then in 5.10 we added y2038 support, so post-Covid filesystems truncate
less of the sequence counter.  #winning

>   ...
>         struct timespec64 ts;
>         ..
>         if (!(mp->m_rbmip->i_diflags & XFS_DIFLAG_NEWRTBM)) {
>                 mp->m_rbmip->i_diflags |= XFS_DIFLAG_NEWRTBM;
>                 seq = 0;
>         } else {
>         ...
>         ts.tv_sec = (time64_t)seq + 1;
>         inode_set_atime_to_ts(VFS_I(mp->m_rbmip), ts);

So... according to the pre-4.6 definition of the sequence counter this
is wrong, but OTOH it's not inconsistent with what was there in 6.4.

> and notice how 'ts.tv_nsec' is never initialized. So we'll set the
> nsec part of the atime to random garbage.
> 
> Oh, I'm sure it doesn't really *matter*, but it's most certainly wrong.

tv_nsec isn't explicitly initialized by rtpick_extent, but IIRC mkfs
initializes the ondisk inode's tv_nsec field and the kernel reads that
into the incore inode, so I dont't think it's leaking kernel memory
contents.

> I am not very happy about the whole crazy XFS model where people cast
> the 'struct timespec64' pointer to an 'uint64_t' pointer, and then say
> 'now it's a sequence number'. This is not the only place that
> happened, ie we have similar disgusting code in at least
> xfs_rtfree_extent() too.
> 
> That other place in xfs_rtfree_extent() didn't have this bug - it does
> inode_get_atime() unconditionally and this keeps the nsec field as-is,
> but that other place has the same really ugly code.
> 
> Doing that "cast struct timespec64 to an uint64_t' is not only ugly
> and wrong, it's _stupid_. The only reason it works in the first place
> is that 'struct timespec64' is
> 
>   struct timespec64 {
>         time64_t        tv_sec;                 /* seconds */
>         long            tv_nsec;                /* nanoseconds */
>   };
> 
> so the first field is 'tv_sec', which is a 64-bit (signed) value.

(yep)

> So the cast is disgusting - and it's pointless. I don't know why it's
> done that way. It would have been much cleaner to just use tv_sec, and
> have a big comment about it being used as a sequence number here.
> 
> I _assume_ there's just a simple 32-bit history to this all, where at
> one point it was a 32-bit tv_sec, and the cast basically used both
> 32-bit fields as a 64-bit sequence number.  I get it. But it's most
> definitely wrong now.

I don't even think it was good C back whenever it was written, but I was
probably in high school at that point. ;)

> End result: I ended up fixing that bug and removing the bogus casts in
> my merge. I *think* I got it right, but apologies in advance if I
> screwed up. I only did visual inspection and build testing, no actual
> real testing.

My opinion is that you've kept your tree consistent with what the kernel
has been doing for the last 5 years.  No comment about the s**tshow that
went on before that.

> Also, xfs people may obviously have other preferences for how to deal
> with the whole "now using tv_sec in the VFS inode as a 64-bit sequence
> number" thing, and maybe you prefer to then update my fix to this all.
> But that horrid casts certainly wasn't the right way to do it.

Yeah, I can work on that for the rt modernization patchset.

> Put another way: please do give my merge a closer look, and decide
> amongst yourself if you then want to deal with this some other way.

Let's see what the other devs say.  Thank you for taking Chandan's pull
request, by the way.

--D

> 
>               Linus




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux