Re: [RFC PATCH] NFSD: Encode COMPOUND operation status on page boundaries

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Dec 23, 2024 at 6:31 AM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>
> On 12/22/24 8:25 PM, Rick Macklem wrote:
> > On Sun, Dec 22, 2024 at 4:10 PM <cel@xxxxxxxxxx> wrote:
> >>
> >> From: Chuck Lever <chuck.lever@xxxxxxxxxx>
> >>
> >> J. David reports an odd corruption of a READDIR reply sent to a
> >> FreeBSD client.
> >>
> >> xdr_reserve_space() has to do a special trick when the @nbytes value
> >> requests more space than there is in the current page of the XDR
> >> buffer.
> >>
> >> In that case, xdr_reserve_space() returns a pointer to the start of
> >> the next page, and then the next call to xdr_reserve_space() invokes
> >> __xdr_commit_encode() to copy enough of the data item back into the
> >> previous page to make that data item contiguous across the page
> >> boundary.
> >>
> >> But we need to be careful in the case where buffer space is reserved
> >> early for a data item that will be inserted into the buffer later.
> >>
> >> One such caller, nfsd4_encode_operation(), reserves 8 bytes in the
> >> encoding buffer for each COMPOUND operation. However, a READDIR
> >> result can sometimes encode file names so that there are only 4
> >> bytes left at the end of the current XDR buffer page (though plenty
> >> of pages are left to handle the remaining encoding tasks).
> >>
> >> If a COMPOUND operation follows the READDIR result (say, a GETATTR),
> >> then nfsd4_encode_operation() will reserve 8 bytes for the op number
> >> (9) and the op status (usually NFS4_OK). In this weird case,
> >> xdr_reserve_space() returns a pointer to byte zero of the next buffer
> >> page, as it assumes the data item will be copied back into place (in
> >> the previous page) on the next call to xdr_reserve_space().
> >>
> >> nfsd4_encode_operation() writes the op num into the buffer, then
> >> saves the next 4-byte location for the op's status code. The next
> >> xdr_reserve_space() call is part of GETATTR encoding, so the op num
> >> gets copied back into the previous page, but the saved location for
> >> the op status continues to point to the wrong spot in the current
> >> XDR buffer page because __xdr_commit_encode() moved that data item.
> >>
> >> After GETATTR encoding is complete, nfsd4_encode_operation() writes
> >> the op status over the first XDR data item in the GETATTR result.
> >> The NFS4_OK status code (0) makes it look like there are zero items
> >> in the GETATTR's attribute bitmask.
> > I can confirm that this patch fixes the test case I have, which is based on
> > J. David's reproducer.
>
> May I add:
>
> Tested-by: Rick Macklem <rick.macklem@xxxxxxxxx>
>
> ?
Sure, if you'd like. I have now tested both versions of the patch.
I suppose I'm more well known as rmacklem@xxxxxxxxxxx but they both work.

>
>
> > I also think the analysis sounds right. I had gotten to the point where
> > I thought it was some oddball case related to xdr_reserve_space() and
> > saw that the pointer used by GETATTR's nfsd4_encode_bitmap4() was
> > 4 bytes into a page, for the broken case.
>
> May I also add:
>
> Reviewed-by: Rick Macklem <rick.macklem@xxxxxxxxx>
>
> ?
Both patches look fine to me, although I do not understand the code
in __write_bytes_to_xdr_buf(), so I'm not sure I'm the guy to review this stuff.
I do understand that a xdr_reserve_space(xdr, 4) is safe, since it is impossible
for it to straddle a page.

I'll leave it up to you.

>
>
> > (As an aside, it appears that "%p" is broken for 32bit arches. I needed to
> > use "%x" with a (unsigned int) cast to printk() pointers. I doubt anyone  much
> > cares about 32bits any more, but I might look to see if I can fix it.)
>
> On Linux, the %p formatter emits a hash instead of the actual pointer
> address, for security reasons. There are ways to disable this -- see
> the "no_hash_pointers" kernel command line argument for the big hammer.
>
> (Documentation/admin-guide/kernel-parameters.txt)
Thanks for the info. The "pointers" did throw me off for a bit;-)

Thanks, rick

>
>
> > Good work Chuck!
>
> Thanks!
>
>
> >> The patch description of commit 2825a7f90753 ("nfsd4: allow encoding
> >> across page boundaries") [2014] remarks that NFSD "can't handle a
> >> new operation starting close to the end of a page." This behavior
> >> appears to be one reason for that remark.
> >>
> >> Break up the reservation of the COMPOUND op num and op status data
> >> items into two distinct 4-octet reservations. Thanks to XDR data
> >> item alignment restrictions, a 4-octet buffer reservation can never
> >> straddle a page boundary.
> > I don't know enough about the code to comment w.r.t. whether or not this
> > is a correct fix, although it seems to work for the test case.
>
> The most correct fix, IMO, would be to use write_bytes_to_xdr_buf()
> to insert the op status once the operation has been encoded. That
> API does not depend on an ephemeral C pointer into a buffer.
>
> The most performant fix is the one proposed here, and this one is
> also likely to apply cleanly to older Linux kernels.
>
>
> > I will note that it is a pretty subtle trap and it would be nice if something
> > could be done to avoid this coding mistake from happening again.
> > All I can think of is defining a new function that must be used instead of
> > xdr_reserve_space() if there will be other encoding calls done before the
> > pointer is used. Something like xdr_reserve_u32_long_term(). (Chuck is
> > much better at naming stuff than I am.)
> > --> If your fix is correct, in general, this function would just call
> > xdr_reserve_space(xdr, XDR_UNIT),
> >        but it could also set a flag in the xdr structure so that it can
> > only be done once until
> >        the flag is cleared (by a function call when the code is done
> > with the pointer), or something like that?
>
> Well there's no limit on the number of times you can call
> xdr_reserve_space(XDR_UNIT) and save the result.
>
> It just so happens that, with the current implementation of
> xdr_reserve_space(), reserving up to 4 octets returns a pointer that
> remains valid as long as the buffer exists. For larger sizes, that
> doesn't happen to work -- the returned pointer is guaranteed to be valid
> only until the next call to xdr_reserve_space() or xdr_commit_encode().
>
> The only reason to use the "save the pointer" trick is because it is
> more performant than write_bytes_to_xdr_buf() (go look at what that API
> does to see why).
>
> So the whole "save a pointer into the XDR buf, use it later" trick is
> brittle and depends on an undocumented behavior of that API. At this
> point, the least we can do is add a warning to reserve_space's kdoc
> comment. The best we could do is come up with an API that performs
> reasonably well and makes this common trope less brittle.
>
> I'm auditing the code base for other places that might make unsafe
> assumptions about the pointer returned from xdr_reserve_space().
> nfsd4_encode_read() and read_plus() both do. Probably the SunRPC GSS-API
> encoders do as well.
>
>
> > Anyhow, others probably can suggest better changes that would avoid this trap?
> >
> > Good work, rick
> >
> >>
> >> Reported-by: J David <j.david.lists@xxxxxxxxx>
> >> Cc: stable@xxxxxxxxxxxxxxx
> >> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> >> ---
> >>   fs/nfsd/nfs4xdr.c | 5 +++--
> >>   1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> >> index 53fac037611c..8780da884197 100644
> >> --- a/fs/nfsd/nfs4xdr.c
> >> +++ b/fs/nfsd/nfs4xdr.c
> >> @@ -5764,10 +5764,11 @@ nfsd4_encode_operation(struct nfsd4_compoundres *resp, struct nfsd4_op *op)
> >>          nfsd4_enc encoder;
> >>          __be32 *p;
> >>
> >> -       p = xdr_reserve_space(xdr, 8);
> >> +       if (xdr_stream_encode_u32(xdr, op->opnum) != XDR_UNIT)
> >> +               goto release;
> >> +       p = xdr_reserve_space(xdr, XDR_UNIT);
> >>          if (!p)
> >>                  goto release;
> >> -       *p++ = cpu_to_be32(op->opnum);
> >>          post_err_offset = xdr->buf->len;
> >>
> >>          if (op->opnum == OP_ILLEGAL)
> >> --
> >> 2.47.0
> >>
>
>
> --
> Chuck Lever





[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux