On Wed, Mar 13, 2024 at 03:20:52AM -0400, Jeff King wrote: > +cc Patrick for reftable > > On Tue, Mar 12, 2024 at 11:10:29AM -0400, Chuck Lever wrote: > > > Unit test t0032 fails when run on an NFS mount: > > > > [vagrant@cel t]$ ./t0032-reftable-unittest.sh > > not ok 1 - unittests > > # > > # TMPDIR=$(pwd) && export TMPDIR && > > # test-tool reftable > > # > > # failed 1 among 1 test(s) > > 1..1 > > The output for this test script is particularly unhelpful because it's > not using our test harness at all, but just running a bunch of internal > tests using a single program. > > Running with "-v" should give more details about what's failing. > > I set up a basic loopback server like: > > mkdir /mnt/{server,client} > exportfs -o rw,sync 127.0.0.1:/mnt/server > mount -t nfs 127.0.0.1:/mnt/server /mnt/client > > and then ran: > > ./t0032-reftable-unittest.sh --root=/mnt/client -v > > Looks like it fails at: > > running test_reftable_stack_compaction_concurrent_clean > reftable/stack_test.c: 1063: failed assertion count_dir_entries(dir) == 2 > Aborted > > > v2.43.2 seems to work OK. > > For me, too. Bisecting shows the problem appearing in 4f36b8597c > (reftable/stack: fix race in up-to-date check, 2024-01-18). I think this is actually benign. I set a breakpoint in the respective test right before double-checking our conditions, and curiously I got back the following list of files: ./stack_test-1027.QJBpnd ./stack_test-1027.QJBpnd/0x000000000001-0x000000000003-dad7ac80.ref ./stack_test-1027.QJBpnd/.nfs000000000001729f00001e11 ./stack_test-1027.QJBpnd/tables.list Notice the ".nfs*" thing? This is a temporary file managed by the NFS client that maintains delete-on-close behaviour because we have unlinked the file while it was still open [1]. But of course we count that file when executing `count_dir_entries()`, and thus we arrive at an unexpected number of files. I will send a patch to fix the test. > PS That test seems to run ~20x slower on NFS versus directly on ext4. > I'd expect a little overhead, but that's quite a bit. I'm not all that surprised here given that the reftable library is quite prone to stat(3P)ing the "tables.list" file, and potentially re-reading it. I kind of suspect that this is what's going on. An alternative explanation might be that mmap'ing over NFS is really slow. Anyway, I will have a deeper look at this and see where we spend all the time. Patrick [1]: https://nfs.sourceforge.net/#faq_d2
Attachment:
signature.asc
Description: PGP signature