> On Sep 2, 2024, at 7:46 AM, Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > On Fri, Aug 30, 2024 at 1:57 AM Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >> >> On 29 Aug 2024, at 8:54, Yafang Shao wrote: >> >>> On Thu, Aug 29, 2024 at 8:44 PM Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>> >>>> On 29 Aug 2024, at 5:13, Yafang Shao wrote: >>>> >>>>> In our production environment, we noticed that some files are missing when >>>>> running the ls command in an NFS directory. However, we can still >>>>> successfully cd into the missing directories. This issue can be illustrated >>>>> as follows: >>>>> >>>>> $ cd nfs >>>>> $ ls >>>>> a b c e f <<<< 'd' is missing >>>>> $ cd d <<<< success >>>>> >>>>> I verified the issue with the latest upstream kernel, and it still >>>>> persists. Further analysis reveals that files go missing when the dtsize is >>>>> expanded. The default dtsize was reduced from 1MB to 4KB in commit >>>>> 580f236737d1 ("NFS: Adjust the amount of readahead performed by NFS readdir"). >>>>> After restoring the default size to 1MB, the issue disappears. I also tried >>>>> setting the default size to 8KB, and the issue similarly disappears. >>>>> >>>>> Upon further analysis, it appears that there is a bad entry being decoded >>>>> in nfs_readdir_entry_decode(). When a bad entry is encountered, the >>>>> decoding process breaks without handling the error. We should revert the >>>>> bad entry in such cases. After implementing this change, the issue is >>>>> resolved. >>>> >>>> It seems like you're trying to handle a server bug of some sort. Have you >>>> been able to look at a wire capture to determine why there's a bad entry? >>> >>> I've used tcpdump to analyze the packets but didn't find anything >>> suspicious. Do you have any suggestions? >> >> I'd check to make sure the server isn't overrunning the READDIR request's >> dircount and maxcount (they should be the same for the linux client). If >> the server isn't exceeding them, then there's a likely client bug. >> >> Ben >> > > Hello Ben, > > Upon thorough examination, we have identified the root cause of the > issue to lie within the NFS server, specifically its behavior of > truncating file listings to match the client's READDIR RPC args->size > parameter without appropriately adjusting the cookie value. After > implementing a fix on the server side, the issue has been resolved. Please post your server fix on this mailing list. Thanks! > However, to enhance resilience and mitigate future server-side > vulnerabilities, it may be prudent to implement client-side handling > mechanisms for such issues. What do you think? The general policy we follow is to avoid fixing server bugs via client-side workarounds. Fix the server in that case. -- Chuck Lever