Re: [RFC PATCH] NFSD: Encode COMPOUND operation status on page boundaries

Chuck Lever <chuck.lever@xxxxxxxxxx> · Mon, 23 Dec 2024 09:31:39 -0500

On 12/22/24 8:25 PM, Rick Macklem wrote:
On Sun, Dec 22, 2024 at 4:10 PM <cel@xxxxxxxxxx> wrote:

From: Chuck Lever <chuck.lever@xxxxxxxxxx>

J. David reports an odd corruption of a READDIR reply sent to a
FreeBSD client.

xdr_reserve_space() has to do a special trick when the @nbytes value
requests more space than there is in the current page of the XDR
buffer.

In that case, xdr_reserve_space() returns a pointer to the start of
the next page, and then the next call to xdr_reserve_space() invokes
__xdr_commit_encode() to copy enough of the data item back into the
previous page to make that data item contiguous across the page
boundary.

But we need to be careful in the case where buffer space is reserved
early for a data item that will be inserted into the buffer later.

One such caller, nfsd4_encode_operation(), reserves 8 bytes in the
encoding buffer for each COMPOUND operation. However, a READDIR
result can sometimes encode file names so that there are only 4
bytes left at the end of the current XDR buffer page (though plenty
of pages are left to handle the remaining encoding tasks).

If a COMPOUND operation follows the READDIR result (say, a GETATTR),
then nfsd4_encode_operation() will reserve 8 bytes for the op number
(9) and the op status (usually NFS4_OK). In this weird case,
xdr_reserve_space() returns a pointer to byte zero of the next buffer
page, as it assumes the data item will be copied back into place (in
the previous page) on the next call to xdr_reserve_space().

nfsd4_encode_operation() writes the op num into the buffer, then
saves the next 4-byte location for the op's status code. The next
xdr_reserve_space() call is part of GETATTR encoding, so the op num
gets copied back into the previous page, but the saved location for
the op status continues to point to the wrong spot in the current
XDR buffer page because __xdr_commit_encode() moved that data item.

After GETATTR encoding is complete, nfsd4_encode_operation() writes
the op status over the first XDR data item in the GETATTR result.
The NFS4_OK status code (0) makes it look like there are zero items
in the GETATTR's attribute bitmask.
I can confirm that this patch fixes the test case I have, which is based on
J. David's reproducer.

May I add:

Tested-by: Rick Macklem <rick.macklem@xxxxxxxxx>

?


I also think the analysis sounds right. I had gotten to the point where
I thought it was some oddball case related to xdr_reserve_space() and
saw that the pointer used by GETATTR's nfsd4_encode_bitmap4() was
4 bytes into a page, for the broken case.

May I also add:

Reviewed-by: Rick Macklem <rick.macklem@xxxxxxxxx>

?


(As an aside, it appears that "%p" is broken for 32bit arches. I needed to
use "%x" with a (unsigned int) cast to printk() pointers. I doubt anyone  much
cares about 32bits any more, but I might look to see if I can fix it.)

On Linux, the %p formatter emits a hash instead of the actual pointer
address, for security reasons. There are ways to disable this -- see
the "no_hash_pointers" kernel command line argument for the big hammer.

(Documentation/admin-guide/kernel-parameters.txt)


Good work Chuck!

Thanks!


The patch description of commit 2825a7f90753 ("nfsd4: allow encoding
across page boundaries") [2014] remarks that NFSD "can't handle a
new operation starting close to the end of a page." This behavior
appears to be one reason for that remark.

Break up the reservation of the COMPOUND op num and op status data
items into two distinct 4-octet reservations. Thanks to XDR data
item alignment restrictions, a 4-octet buffer reservation can never
straddle a page boundary.
I don't know enough about the code to comment w.r.t. whether or not this
is a correct fix, although it seems to work for the test case.

The most correct fix, IMO, would be to use write_bytes_to_xdr_buf()
to insert the op status once the operation has been encoded. That
API does not depend on an ephemeral C pointer into a buffer.

The most performant fix is the one proposed here, and this one is
also likely to apply cleanly to older Linux kernels.


I will note that it is a pretty subtle trap and it would be nice if something
could be done to avoid this coding mistake from happening again.
All I can think of is defining a new function that must be used instead of
xdr_reserve_space() if there will be other encoding calls done before the
pointer is used. Something like xdr_reserve_u32_long_term(). (Chuck is
much better at naming stuff than I am.)
--> If your fix is correct, in general, this function would just call
xdr_reserve_space(xdr, XDR_UNIT),
       but it could also set a flag in the xdr structure so that it can
only be done once until
       the flag is cleared (by a function call when the code is done
with the pointer), or something like that?

Well there's no limit on the number of times you can call
xdr_reserve_space(XDR_UNIT) and save the result.

It just so happens that, with the current implementation of
xdr_reserve_space(), reserving up to 4 octets returns a pointer that
remains valid as long as the buffer exists. For larger sizes, that
doesn't happen to work -- the returned pointer is guaranteed to be valid
only until the next call to xdr_reserve_space() or xdr_commit_encode().

The only reason to use the "save the pointer" trick is because it is
more performant than write_bytes_to_xdr_buf() (go look at what that API
does to see why).

So the whole "save a pointer into the XDR buf, use it later" trick is
brittle and depends on an undocumented behavior of that API. At this
point, the least we can do is add a warning to reserve_space's kdoc
comment. The best we could do is come up with an API that performs
reasonably well and makes this common trope less brittle.

I'm auditing the code base for other places that might make unsafe
assumptions about the pointer returned from xdr_reserve_space().
nfsd4_encode_read() and read_plus() both do. Probably the SunRPC GSS-API
encoders do as well.


Anyhow, others probably can suggest better changes that would avoid this trap?

Good work, rick


Reported-by: J David <j.david.lists@xxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
---
  fs/nfsd/nfs4xdr.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 53fac037611c..8780da884197 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -5764,10 +5764,11 @@ nfsd4_encode_operation(struct nfsd4_compoundres *resp, struct nfsd4_op *op)
         nfsd4_enc encoder;
         __be32 *p;

-       p = xdr_reserve_space(xdr, 8);
+       if (xdr_stream_encode_u32(xdr, op->opnum) != XDR_UNIT)
+               goto release;
+       p = xdr_reserve_space(xdr, XDR_UNIT);
         if (!p)
                 goto release;
-       *p++ = cpu_to_be32(op->opnum);
         post_err_offset = xdr->buf->len;

         if (op->opnum == OP_ILLEGAL)
--
2.47.0



--
Chuck Lever