Re: [RFC PATCH] NFS: Fix missing files in `ls` command output

Yafang Shao <laoar.shao@xxxxxxxxx> · Sun, 1 Sep 2024 13:52:39 +0800

On Fri, Aug 30, 2024 at 1:57 AM Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>
> On 29 Aug 2024, at 8:54, Yafang Shao wrote:
>
> > On Thu, Aug 29, 2024 at 8:44 PM Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
> >>
> >> On 29 Aug 2024, at 5:13, Yafang Shao wrote:
> >>
> >>> In our production environment, we noticed that some files are missing when
> >>> running the ls command in an NFS directory. However, we can still
> >>> successfully cd into the missing directories. This issue can be illustrated
> >>> as follows:
> >>>
> >>>   $ cd nfs
> >>>   $ ls
> >>>   a b c e f            <<<< 'd' is missing
> >>>   $ cd d               <<<< success
> >>>
> >>> I verified the issue with the latest upstream kernel, and it still
> >>> persists. Further analysis reveals that files go missing when the dtsize is
> >>> expanded. The default dtsize was reduced from 1MB to 4KB in commit
> >>> 580f236737d1 ("NFS: Adjust the amount of readahead performed by NFS readdir").
> >>> After restoring the default size to 1MB, the issue disappears. I also tried
> >>> setting the default size to 8KB, and the issue similarly disappears.
> >>>
> >>> Upon further analysis, it appears that there is a bad entry being decoded
> >>> in nfs_readdir_entry_decode(). When a bad entry is encountered, the
> >>> decoding process breaks without handling the error. We should revert the
> >>> bad entry in such cases. After implementing this change, the issue is
> >>> resolved.
> >>
> >> It seems like you're trying to handle a server bug of some sort.  Have you
> >> been able to look at a wire capture to determine why there's a bad entry?
> >
> > I've used tcpdump to analyze the packets but didn't find anything
> > suspicious. Do you have any suggestions?
>
> I'd check to make sure the server isn't overrunning the READDIR request's
> dircount and maxcount (they should be the same for the linux client).  If
> the server isn't exceeding them, then there's a likely client bug.

Thank you for the suggestion. I have captured and analyzed the NFS RPC
traffic using Wireshark. I noticed that the ls command is being split
into two NFS READDIR operations. In the first READDIR request, both
the dircount and maxcount parameters are set to 4008. In the
subsequent READDIR request, both dircount and maxcount are set to
8192.

Interestingly, when I increase the value of ctx->dtsize to 8192, the
ls command now generates only a single NFS READDIR RPC call. In this
case, both the dircount and maxcount parameters are set to 8104. This
issue disappears as well.

--
Regards
Yafang