Re: [QUESTION] zig build systems fails on XFS V4 volumes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 05, 2024 at 02:12:43PM +0100, Donald Buczek wrote:
> On 2/4/24 22:56, Dave Chinner wrote:
> > On Sat, Feb 03, 2024 at 06:50:31PM +0100, Donald Buczek wrote:
> >> Dear Experts,
> >>
> >> I'm encountering consistent build failures with the Zig
> >> language from source on certain systems, and I'm seeking
> >> insights into the issue.
> >>
> >> Issue Summary:
> >>
> >>     Build fails on XFS volumes with V4 format (crc=0).  Build
> >>     succeeds on XFS volumes with V5 format (crc=1), regardless
> >>     of bigtime value.
> > 
> > mkfs.xfs output for a successful build vs a broken build,
> > please!
> > 
> > Also a description of the hardware and storage stack
> > configuration would be useful.
> > 
> >>
> >> Observations:
> >>
> >>     The failure occurs silently during Zig's native build
> >>     process.
> > 
> > What is the actual failure? What is the symptoms of this "silent
> > failure". Please give output showing how the failure is occurs,
> > how it is detected, etc. From there we can work to identify what
> > to look at next.
> > 
> > Everything remaining in the bug report is pure speculation, but
> > there's no information provided that allows us to do anything
> > other than speculate in return, so I'm just going to ignore it.
> > Document the evidence of the problem so we can understand it -
> > speculation about causes in the absence of evidence is simply
> > not helpful....
> 
> I was actually just hoping that someone could confirm that the
> functionality, as visible from userspace, should be identical,
> apart from timing. Or, that someone might have an idea based on
> experience what could be causing the different behavior. This was
> not intended as a bug report for XFS.

Maybe not, but as a report of "weird unexpected behaviour on XFS"
it could be an XFS issue....

[....]

> There is also a script cmp.sh and its output cmp.log, which
> compares the xfs_ok and xfs_fail directories. It also produces
> traces.cmp.txt which is a (width 200) side by side comparison of
> the strace files.

I think this one contains a smoking gun w.r.t. whatever code is
running. Near the end of the first trace comaprison, there is an
iteration of test/cases via getdents64(). They have different
behaviour, yet the directory structure is the same.

Good:

openat(3, "test/cases", O_RDONLY|O_CLOEXEC|O_DIRECTORY) = 7
lseek(7, 0, SEEK_SET)                   = 0
getdents64(7, 0x7f8d106b69b8 /* 21 entries */, 1024) = 1000
getdents64(7, 0x7f8d106b69b8 /* 21 entries */, 1024) = 1016
getdents64(7, 0x7f8d106b69b8 /* 21 entries */, 1024) = 1000
getdents64(7, 0x7f8d106b69b8 /* 23 entries */, 1024) = 1016
openat(7, "compile_errors", O_RDONLY|O_CLOEXEC|O_DIRECTORY) = 8
getdents64(8, 0x7f8d106b6de0 /* 16 entries */, 1024) = 968
getdents64(8, 0x7f8d106b6de0 /* 17 entries */, 1024) = 1008
getdents64(8, 0x7f8d106b6de0 /* 17 entries */, 1024) = 1016
getdents64(8, 0x7f8d106b6de0 /* 14 entries */, 1024) = 968
openat(8, "async", O_RDONLY|O_CLOEXEC|O_DIRECTORY) = 9
getdents64(9, 0x7f8d106b7208 /* 16 entries */, 1024) = 1000
......


Bad:

openat(3, "test/cases", O_RDONLY|O_CLOEXEC|O_DIRECTORY) = 7
lseek(7, 0, SEEK_SET)                   = 0
getdents64(7, 0x7f2593eb89b8 /* 21 entries */, 1024) = 1000
getdents64(7, 0x7f2593eb89b8 /* 21 entries */, 1024) = 1016
getdents64(7, 0x7f2593eb89b8 /* 21 entries */, 1024) = 1000
getdents64(7, 0x7f2593eb89b8 /* 23 entries */, 1024) = 1016
getdents64(7, 0x7f2593eb89b8 /* 25 entries */, 1024) = 1016
getdents64(7, 0x7f2593eb89b8 /* 20 entries */, 1024) = 1016
getdents64(7, 0x7f2593eb89b8 /* 19 entries */, 1024) = 992
getdents64(7, 0x7f2593eb89b8 /* 22 entries */, 1024) = 1016
getdents64(7, 0x7f2593eb89b8 /* 22 entries */, 1024) = 992
getdents64(7, 0x7f2593eb89b8 /* 17 entries */, 1024) = 760
getdents64(7, 0x7f2593eb89b8 /* 0 entries */, 1024) = 0

In the good case, we see a test/cases being read, and then the first
subdir test/cases/compile_errors being opened and read. And then a
subdir test/cases/compile_errors/async being opened and read.

IOWs, in the good case it's doing a depth first directory traversal.

In the bad case, there's no subdirectories being opened and read.

I see the same difference in other traces that involve directory
traversal.

The reason for this difference seems obvious: there's a distinct
lack of stat() calls in the ftype=0 (bad) case. dirent->d_type in
this situation will be reporting DT_UNKNOWN for all entries except
'.' and '..'. It is the application's responsibility to handle this,
as the only way to determine if a DT_UNKNOWN entry is a directory is
to stat() the pathname and look at the st_mode returned.

The code is clearly not doing this, and so I'm guessing that the zig
people have rolled their own nftw() function and didn't pay
attention to the getdents() man page:

	Currently,  only some filesystems (among them: Btrfs, ext2,
	ext3, and ext4) have full support for returning the file
	type in d_type.  All applications must properly handle a
	return of DT_UNKNOWN.

So, yeah, looks like someone didn't read the getdents man page
completely and it's not a filesystem issue.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux