On Tue, Sep 10, 2024 at 5:58 PM Bernd Schubert <bernd.schubert@xxxxxxxxxxx> wrote: > > I have noticed Go-FUSE test failures of late, that seem to originate > > in (changed?) kernel behavior. The problems looks like this: > > > > 12:54:13.385435 rx 20: OPENDIR n1 p330882 > > 12:54:13.385514 tx 20: OK, {Fh 1 } > > 12:54:13.385838 rx 22: READDIRPLUS n1 {Fh 1 [0 +4096) L 0 LARGEFILE} p330882 > > 12:54:13.385844 rx 23: INTERRUPT n0 {ix 22} p0 > > 12:54:13.386114 tx 22: OK, 4000b data "\x02\x00\x00\x00\x00\x00\x00\x00"... > > 12:54:13.386642 rx 24: READDIRPLUS n1 {Fh 1 [1 +4096) L 0 LARGEFILE} p330882 > > 12:54:13.386849 tx 24: 95=operation not supported > > > > As you can see, the kernel attempts to interrupt the READDIRPLUS > > do you where the interrupt comes from? Is your test interrupting > interrupting readdir? I did not write code to issue interrupts, but it is possible that the Go runtime does something behind my back. The debug output lists "p0"; is there a reason that the INTERRUPT opcode does not provide an originating PID ? How would I discover who or what is generating the interrupts? Will they show up if I run strace on the test binary? I straced the test binary, below is a section where an interrupt happens just before a directory seek. Could tgkill(SIGURG) cause an interrupt? It happens just before the READDIRPLUS op (',') and the INTERRUPT operation ('$') are read. [pid 371933] writev(10, [{iov_base="\260\17\0\0\0\0\0\0\20\0\0\0\0\0\0\0", iov_len=16}, {iov_base="\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\ 0\0\0"..., iov_len=4000}], 2 <unfinished ...> [pid 371931] <... nanosleep resumed>NULL) = 0 [pid 371933] <... writev resumed>) = 4016 [pid 371931] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...> [pid 371933] read(10, <unfinished ...> [pid 371934] <... getdents64 resumed>0xc0002ee000 /* 25 entries */, 8192) = 800 [pid 371931] <... nanosleep resumed>NULL) = 0 [pid 371934] futex(0xc000059148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 371931] getpid( <unfinished ...> [pid 371934] <... futex resumed>) = 1 [pid 371932] <... futex resumed>) = 0 [pid 371931] <... getpid resumed>) = 371930 [pid 371934] getdents64(7, <unfinished ...> [pid 371932] futex(0xc000059148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 371931] tgkill(371930, 371934, SIGURG <unfinished ...> [pid 371935] <... read resumed>"P\0\0\0,\0\0\0\22\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131200) = 80 # , = readdirplus [pid 371933] <... read resumed>"0\0\0\0$\0\0\0\23\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131200) = 48 # $ = interrupt. [pid 371931] <... tgkill resumed>) = 0 [pid 371933] futex(0xc000059148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 371931] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...> [pid 371933] <... futex resumed>) = 1 [pid 371932] <... futex resumed>) = 0 [pid 371933] write(2, "19:24:40.813294 doInterrupt\n", 28 <unfinished ...> 19:24:40.813294 doInterrupt A bit of browsing through the Go source code suggests that SIGURG is used to preempt long-running goroutines, so it could be issued more or less at random. Nevertheless, FUSE should also not be reissuing the reads, even if there were interrupts, right? > > operation, but go-fuse ignores the interrupt and returns 25 entries. > > The kernel somehow thinks that only 1 entry was consumed, and issues > > the next READDIRPLUS at offset 1. If go-fuse ignores the faulty offset > > and continues the listing (ie. continuing with entry 25), the test > > passes. > > > > Is this behavior of the kernel expected or a bug? > > > > I am redoing the API for directory listing to support cacheable and > > seekable directories, and in the new version, this looks like a > > directory seek. If the file system does not support seekable > > directories, I must return some kind of error (which is the ENOTSUP > > you can see in the log above). > > Is this with or without FOPEN_CACHE_DIR? Would be helpful to know > if FOPEN_CACHE_DIR - fuse kernel code is quite different when this > is set. without FOPEN_CACHE_DIR. > > I started seeing this after upgrading to Fedora 40. My kernel is > > 6.10.7-200.fc40.x86_64 > > > > Would be interesting to know your kernel version before? There is > commit cdf6ac2a03d2, which removes a readdir lock. Although the > commit message explains why it is not needed anymore. The failure happens in single threaded loads, so a race condition seems unlikely. Let me try with other kernel versions. Also, I realize that Fedora 40 also upgraded the Go compiler, and may have made SIGURG be triggered more frequently? -- Han-Wen Nienhuys - hanwenn@xxxxxxxxx - http://www.xs4all.nl/~hanwen