Re: [RFC PATCH] NFS: Fix missing files in `ls` command output

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Tue, 3 Sep 2024 13:48:17 +0000

> On Sep 2, 2024, at 2:27 PM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
> 
> 
> 
>> On Sep 2, 2024, at 7:46 AM, Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
>> 
>> On Fri, Aug 30, 2024 at 1:57 AM Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>> 
>>> On 29 Aug 2024, at 8:54, Yafang Shao wrote:
>>> 
>>>> On Thu, Aug 29, 2024 at 8:44 PM Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>>>> 
>>>>> On 29 Aug 2024, at 5:13, Yafang Shao wrote:
>>>>> 
>>>>>> In our production environment, we noticed that some files are missing when
>>>>>> running the ls command in an NFS directory. However, we can still
>>>>>> successfully cd into the missing directories. This issue can be illustrated
>>>>>> as follows:
>>>>>> 
>>>>>> $ cd nfs
>>>>>> $ ls
>>>>>> a b c e f            <<<< 'd' is missing
>>>>>> $ cd d               <<<< success
>>>>>> 
>>>>>> I verified the issue with the latest upstream kernel, and it still
>>>>>> persists. Further analysis reveals that files go missing when the dtsize is
>>>>>> expanded. The default dtsize was reduced from 1MB to 4KB in commit
>>>>>> 580f236737d1 ("NFS: Adjust the amount of readahead performed by NFS readdir").
>>>>>> After restoring the default size to 1MB, the issue disappears. I also tried
>>>>>> setting the default size to 8KB, and the issue similarly disappears.
>>>>>> 
>>>>>> Upon further analysis, it appears that there is a bad entry being decoded
>>>>>> in nfs_readdir_entry_decode(). When a bad entry is encountered, the
>>>>>> decoding process breaks without handling the error. We should revert the
>>>>>> bad entry in such cases. After implementing this change, the issue is
>>>>>> resolved.
>>>>> 
>>>>> It seems like you're trying to handle a server bug of some sort.  Have you
>>>>> been able to look at a wire capture to determine why there's a bad entry?
>>>> 
>>>> I've used tcpdump to analyze the packets but didn't find anything
>>>> suspicious. Do you have any suggestions?
>>> 
>>> I'd check to make sure the server isn't overrunning the READDIR request's
>>> dircount and maxcount (they should be the same for the linux client).  If
>>> the server isn't exceeding them, then there's a likely client bug.
>>> 
>>> Ben
>>> 
>> 
>> Hello Ben,
>> 
>> Upon thorough examination, we have identified the root cause of the
>> issue to lie within the NFS server, specifically its behavior of
>> truncating file listings to match the client's READDIR RPC args->size
>> parameter without appropriately adjusting the cookie value. After
>> implementing a fix on the server side, the issue has been resolved.
> 
> Please post your server fix on this mailing list. Thanks!

I was assuming your test server was Linux NFSD. If not,
then please ignore me!

--
Chuck Lever