Re: [PATCH v1] generic/476: requires 27GB scratch size

"Theodore Ts'o" <tytso@xxxxxxx> · Thu, 21 Jul 2022 07:42:47 -0400

On Thu, Jul 21, 2022 at 03:26:05PM +0800, Boyang Xue wrote:
> > > I find generic/476 easily goes into an infinite run on top of NFS. When it
> >
> > Infinite?  It's only supposed to start 25000*nr_cpus*TIME_FACTOR
> > operations, so it /should/ conclude eventually.  That includes driving
> > the filesystem completel out of space, but there ought to be enough
> > unlink/rmdir/truncate calls to free up space every now and then...
> 
> Yes. I'm not sure the calculations inside, but when the size of the
> scratch device < 27GB (can be 26GB when the backing storage is ext4
> rather than xfs), the test runs infinitely. I'm aware that the test
> should be slow, especially on NFS, but I see the test never finishes
> after multi-days. This problem happens in both localhost exported NFS
> and remote exported NFS configurations.

I can partially confirm this.  I had noted a few weeks ago that I
needed to exclude generic/476 or the test VM would hang for over 24
hours, a which point I lost patience and terminated the VM.  I had
gotten as far as

	gce-xfstests -c nfs -g auto -X generic/476

(which is a loopback config) using 5.19-rc4 in order to get a test run
to complete.

Note: this was also triggering failures of generic/426 and
generic/551, which I also haven't had time to investigate, not being
an NFS developer.  :-)

I wasn't sure whether generic/476 never terminating was caused by a
loopback-triggered deadlock, or something else.  But it sounds like
you've isolated it to the scratch device *too* small, and since that
the failure occurred even on a configuration where the client and
server were on different machines/VM's, correct?

> > >  _require_scratch
> > > +_require_scratch_size $((27 * 1024 * 1024)) # 27GB
> >
> > ...so IDGI, this test works as intended.  Are you saying that NFS
> > command overhead is so high that this test takes too long?

I interpreted this as "if the drive is too small, we're hitting some
kind of problem".  This *could* be some kind of problem which triggers
on ENOSPC; perhaps it's just much more likely on a smaller device?  So
it's possible this is not a test bug, but an NFS problem.  Perhaps we
should forward this off to the NFS folks first?

	   		     	   	  - Ted