On Mon, Dec 23, 2024 at 6:31 AM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > > On 12/22/24 8:25 PM, Rick Macklem wrote: > > On Sun, Dec 22, 2024 at 4:10 PM <cel@xxxxxxxxxx> wrote: > >> > >> From: Chuck Lever <chuck.lever@xxxxxxxxxx> > >> > >> J. David reports an odd corruption of a READDIR reply sent to a > >> FreeBSD client. > >> > >> xdr_reserve_space() has to do a special trick when the @nbytes value > >> requests more space than there is in the current page of the XDR > >> buffer. > >> > >> In that case, xdr_reserve_space() returns a pointer to the start of > >> the next page, and then the next call to xdr_reserve_space() invokes > >> __xdr_commit_encode() to copy enough of the data item back into the > >> previous page to make that data item contiguous across the page > >> boundary. > >> > >> But we need to be careful in the case where buffer space is reserved > >> early for a data item that will be inserted into the buffer later. > >> > >> One such caller, nfsd4_encode_operation(), reserves 8 bytes in the > >> encoding buffer for each COMPOUND operation. However, a READDIR > >> result can sometimes encode file names so that there are only 4 > >> bytes left at the end of the current XDR buffer page (though plenty > >> of pages are left to handle the remaining encoding tasks). > >> > >> If a COMPOUND operation follows the READDIR result (say, a GETATTR), > >> then nfsd4_encode_operation() will reserve 8 bytes for the op number > >> (9) and the op status (usually NFS4_OK). In this weird case, > >> xdr_reserve_space() returns a pointer to byte zero of the next buffer > >> page, as it assumes the data item will be copied back into place (in > >> the previous page) on the next call to xdr_reserve_space(). > >> > >> nfsd4_encode_operation() writes the op num into the buffer, then > >> saves the next 4-byte location for the op's status code. The next > >> xdr_reserve_space() call is part of GETATTR encoding, so the op num > >> gets copied back into the previous page, but the saved location for > >> the op status continues to point to the wrong spot in the current > >> XDR buffer page because __xdr_commit_encode() moved that data item. > >> > >> After GETATTR encoding is complete, nfsd4_encode_operation() writes > >> the op status over the first XDR data item in the GETATTR result. > >> The NFS4_OK status code (0) makes it look like there are zero items > >> in the GETATTR's attribute bitmask. > > I can confirm that this patch fixes the test case I have, which is based on > > J. David's reproducer. > > May I add: > > Tested-by: Rick Macklem <rick.macklem@xxxxxxxxx> > > ? Sure, if you'd like. I have now tested both versions of the patch. I suppose I'm more well known as rmacklem@xxxxxxxxxxx but they both work. > > > > I also think the analysis sounds right. I had gotten to the point where > > I thought it was some oddball case related to xdr_reserve_space() and > > saw that the pointer used by GETATTR's nfsd4_encode_bitmap4() was > > 4 bytes into a page, for the broken case. > > May I also add: > > Reviewed-by: Rick Macklem <rick.macklem@xxxxxxxxx> > > ? Both patches look fine to me, although I do not understand the code in __write_bytes_to_xdr_buf(), so I'm not sure I'm the guy to review this stuff. I do understand that a xdr_reserve_space(xdr, 4) is safe, since it is impossible for it to straddle a page. I'll leave it up to you. > > > > (As an aside, it appears that "%p" is broken for 32bit arches. I needed to > > use "%x" with a (unsigned int) cast to printk() pointers. I doubt anyone much > > cares about 32bits any more, but I might look to see if I can fix it.) > > On Linux, the %p formatter emits a hash instead of the actual pointer > address, for security reasons. There are ways to disable this -- see > the "no_hash_pointers" kernel command line argument for the big hammer. > > (Documentation/admin-guide/kernel-parameters.txt) Thanks for the info. The "pointers" did throw me off for a bit;-) Thanks, rick > > > > Good work Chuck! > > Thanks! > > > >> The patch description of commit 2825a7f90753 ("nfsd4: allow encoding > >> across page boundaries") [2014] remarks that NFSD "can't handle a > >> new operation starting close to the end of a page." This behavior > >> appears to be one reason for that remark. > >> > >> Break up the reservation of the COMPOUND op num and op status data > >> items into two distinct 4-octet reservations. Thanks to XDR data > >> item alignment restrictions, a 4-octet buffer reservation can never > >> straddle a page boundary. > > I don't know enough about the code to comment w.r.t. whether or not this > > is a correct fix, although it seems to work for the test case. > > The most correct fix, IMO, would be to use write_bytes_to_xdr_buf() > to insert the op status once the operation has been encoded. That > API does not depend on an ephemeral C pointer into a buffer. > > The most performant fix is the one proposed here, and this one is > also likely to apply cleanly to older Linux kernels. > > > > I will note that it is a pretty subtle trap and it would be nice if something > > could be done to avoid this coding mistake from happening again. > > All I can think of is defining a new function that must be used instead of > > xdr_reserve_space() if there will be other encoding calls done before the > > pointer is used. Something like xdr_reserve_u32_long_term(). (Chuck is > > much better at naming stuff than I am.) > > --> If your fix is correct, in general, this function would just call > > xdr_reserve_space(xdr, XDR_UNIT), > > but it could also set a flag in the xdr structure so that it can > > only be done once until > > the flag is cleared (by a function call when the code is done > > with the pointer), or something like that? > > Well there's no limit on the number of times you can call > xdr_reserve_space(XDR_UNIT) and save the result. > > It just so happens that, with the current implementation of > xdr_reserve_space(), reserving up to 4 octets returns a pointer that > remains valid as long as the buffer exists. For larger sizes, that > doesn't happen to work -- the returned pointer is guaranteed to be valid > only until the next call to xdr_reserve_space() or xdr_commit_encode(). > > The only reason to use the "save the pointer" trick is because it is > more performant than write_bytes_to_xdr_buf() (go look at what that API > does to see why). > > So the whole "save a pointer into the XDR buf, use it later" trick is > brittle and depends on an undocumented behavior of that API. At this > point, the least we can do is add a warning to reserve_space's kdoc > comment. The best we could do is come up with an API that performs > reasonably well and makes this common trope less brittle. > > I'm auditing the code base for other places that might make unsafe > assumptions about the pointer returned from xdr_reserve_space(). > nfsd4_encode_read() and read_plus() both do. Probably the SunRPC GSS-API > encoders do as well. > > > > Anyhow, others probably can suggest better changes that would avoid this trap? > > > > Good work, rick > > > >> > >> Reported-by: J David <j.david.lists@xxxxxxxxx> > >> Cc: stable@xxxxxxxxxxxxxxx > >> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> > >> --- > >> fs/nfsd/nfs4xdr.c | 5 +++-- > >> 1 file changed, 3 insertions(+), 2 deletions(-) > >> > >> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > >> index 53fac037611c..8780da884197 100644 > >> --- a/fs/nfsd/nfs4xdr.c > >> +++ b/fs/nfsd/nfs4xdr.c > >> @@ -5764,10 +5764,11 @@ nfsd4_encode_operation(struct nfsd4_compoundres *resp, struct nfsd4_op *op) > >> nfsd4_enc encoder; > >> __be32 *p; > >> > >> - p = xdr_reserve_space(xdr, 8); > >> + if (xdr_stream_encode_u32(xdr, op->opnum) != XDR_UNIT) > >> + goto release; > >> + p = xdr_reserve_space(xdr, XDR_UNIT); > >> if (!p) > >> goto release; > >> - *p++ = cpu_to_be32(op->opnum); > >> post_err_offset = xdr->buf->len; > >> > >> if (op->opnum == OP_ILLEGAL) > >> -- > >> 2.47.0 > >> > > > -- > Chuck Lever