Too many ENOSPC errors

Chris Perl <cperl@xxxxxxxxxxxxxx> · Thu, 8 Jun 2023 13:05:10 -0400

Hi everyone,

I'm working with several Red Hat derived systems and have noticed an
issue with ENOSPC and NFS that I'm looking for some guidance on.

First let me describe the testing setup, and then I'll share my
results from an EL7 based system (kernel 3.10.0-1160.90.1.el7), an EL8
based system (kernel 4.18.0-425.19.2.el8_7), an EL8 based system
patched with commit e6005436f6cc9ed13288f936903f0151e5543485 (kernel
4.18.0-425.19.2.el8_7 plus that commit), and finally an EL8 based
system but with an upstream 6.1 kernel.

Assume I have a 20M quota on my current working directory which is an
NFS export from one of the major enterprise vendors.

The testing looks like the following:

# rm -f file1
# touch file1
# dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
20+0 records in
20+0 records out
20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# rm -f file2
# tee -a file1 <<< abc
abc

On an EL7 based system, running the above works just as shown. I.e.
you create file1, then use all the quota with file2, attempt to write
to file1 which fails with ENOSPC (as expected), remove file2 (which
frees up the quota), and then attempt to write to file1 again which
succeeds.

However, on a stock EL8 based system, I instead get the following
surprising behavior:

# rm -f file1
# touch file1
# dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
20+0 records in
20+0 records out
20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# rm -f file2
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# tee -a file1 <<< abc
abc
tee: file1: No space left on device

I.e. Even after freeing the space by removing file2, writing to file1
continues to fail with ENOSPC forever (I've only shown 2 iterations
above) [1]. No amount of waiting will cause it to go away. But, we've
found that running sync(1) on the file will fix it (the sync itself
will complain with ENOSPC, but then subsequent tee invocatinos
succeed).

I thought that perhaps the issue was the fact that kernel
4.18.0-425.19.2.el8_7 was missing commit
e6005436f6cc9ed13288f936903f0151e5543485 (which adds some ENOSPC
handling to `nfs_file_write'), so we patched the kernel with that
patch and tested again. It's worth saying that with this patch, the
behavior of our 4.18 kernel and the 6.1 kernel are consistent when
running this test, but I feel like there might still be a bug here.

What I get now is:

# rm -f file1
# touch file1
# dd bs=1M count=20 if=/dev/zero of=file2 # this will use all the quota
20+0 records in
20+0 records out
20971520 bytes (21 MB, 20 MiB) copied, 0.193018 s, 109 MB/s
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# rm -f file2
# tee -a file1 <<< abc
abc
tee: file1: No space left on device
# tee -a file1 <<< abc
abc

I.e. The first attempt to write to the file after freeing the quota
fails with ENOSPC, but subsequent attempts succeed. Note that this is
different from the original behavior on our EL7 based system as shown
above where as soon as the quota is freed up, there are no more ENOSPC
errors.

I'm no expert, but below I'm including some digging I did in case it's
helpful for understanding the situation more fully without needing to
reproduce yourselves. If it's not helpful and just distracting,
apologies in advance!

>From strace'ing and systemtap'ing I noticed that the first call to
`tee' (after the quota is used up by file2) returns the ENOSPC in
response to close(2) (i.e. via `nfs_file_flush') and the second call
to `tee' (after the quota has been freed) returns the ENOSPC in
response to the write(2) (i.e. via `nfs_file_write' , and then clears
the error via the changes we introduced with commit
e6005436f6cc9ed13288f936903f0151e5543485).

Here is the output of [2], run while reproducing (the comments and
spacing are mine just to more easily be able to tell things apart):

# stap -vv --vp 10101 /tmp/t.stp
...
# This section is the first tee invocation when the quota is fully
consumed by file2
1686164924543967263: tee
module("nfs").function("nfs_file_write@fs/nfs/file.c:592").call
f_wb_err: wb_err: flags:0x0
1686164924543983309: tee
module("nfs").function("nfs_file_write@fs/nfs/file.c:592").return
f_wb_err: wb_err: flags:0x0
1686164924543986715: tee
module("nfs").function("nfs_file_flush@fs/nfs/file.c:139").call
f_wb_err: wb_err: flags:0x0
1686164924545701586: tee
module("nfs").function("nfs_file_flush@fs/nfs/file.c:139").return
f_wb_err: wb_err:ENOSPC flags:0x0

# This section is the second tee invocation when the quota has been freed
1686164933146193798: tee
module("nfs").function("nfs_file_write@fs/nfs/file.c:592").call
f_wb_err: wb_err:ENOSPC flags:0x0
1686164933147496167: tee
module("nfs").function("nfs_file_write@fs/nfs/file.c:592").return
f_wb_err: wb_err: flags:0x0
1686164933147623303: tee
module("nfs").function("nfs_file_flush@fs/nfs/file.c:139").call
f_wb_err: wb_err: flags:0x0
1686164933147627755: tee
module("nfs").function("nfs_file_flush@fs/nfs/file.c:139").return
f_wb_err: wb_err: flags:0x0

# This section is the third tee invocation that succeeds with no errors
1686164937358310484: tee
module("nfs").function("nfs_file_write@fs/nfs/file.c:592").call
f_wb_err: wb_err: flags:0x0
1686164937358323369: tee
module("nfs").function("nfs_file_write@fs/nfs/file.c:592").return
f_wb_err: wb_err: flags:0x0
1686164937358326649: tee
module("nfs").function("nfs_file_flush@fs/nfs/file.c:139").call
f_wb_err: wb_err: flags:0x0
1686164937359877744: tee
module("nfs").function("nfs_file_flush@fs/nfs/file.c:139").return
f_wb_err: wb_err: flags:0x0

I barely know what I'm looking at, but I wondered whether
`nfs_file_flush' should be resetting the error when it returns by
calling `file_check_and_advance_wb_err' instead of
`filemap_check_wb_err'. Given that it's reported the error to user
space, is that sufficient to clear the error? Or, does it need to keep
the error because there are dirty pages that haven't been written back
yet?

I guess what I'm asking is, is the above behavior we've observed on
EL8 patched with e6005436f6cc9ed13288f936903f0151e5543485 and
separately with 6.1 the expected behavior? Or should it behave like it
did in EL7?

Any insight or direction would be greatly appreciated!

Chris

[1]: It's not actually the write(2) that is giving the ENOSPC, it's
the close(2) and in fact the bytes successfully wind up in the file
even though the command appears to fail.

[2]: A systemtap script I threw together to look at fields I thought
might come into play:

function dump(file:long) {
        printf("%d: %s %70s f_wb_err:%6s wb_err:%6s flags:%#x\n",
                   gettimeofday_ns(),
                   execname(),
                   pp(),
                   errno_str(@cast(file, "struct file")->f_wb_err),
                   errno_str(@cast(file, "struct file")->f_mapping->wb_err),
                   @cast(file, "struct file")->f_mapping->flags);
}

probe module("nfs").function("nfs_file_write").call {
        if (!isinstr(execname(), "tee"))
                next
        dump(@cast($iocb, "struct kiocb")->ki_filp);
}

probe module("nfs").function("nfs_file_write").return {
        if (!isinstr(execname(), "tee"))
                next
        dump(@cast(@entry($iocb), "struct kiocb")->ki_filp);
}

probe module("nfs").function("nfs_file_flush").call {
        if (!isinstr(execname(), "tee"))
                next
        dump($file);
}

probe module("nfs").function("nfs_file_flush").return {
        if (!isinstr(execname(), "tee"))
                next
        dump(@entry($file));
}