Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

Charles Hedrick <hedrick@xxxxxxxxxxx> · Mon, 9 Aug 2021 22:01:39 +0000

yes, but the timing may be different. When a new file is created, inotify will tell AMP about it, and AMP will immediately read it.

> On Aug 9, 2021, at 5:49:30 PM, Timothy Pearson <tpearson@xxxxxxxxxxxxxxxxxxxxx> wrote:
> 
> I'm not sure that is much different than the load patterns we end up generating, with mixed remote and local I/O.  I'd think that such a scenario is fairly typical, especially when factoring in backup processes.
> 
> ----- Original Message -----
>> From: "hedrick" <hedrick@xxxxxxxxxxx>
>> To: "Timothy Pearson" <tpearson@xxxxxxxxxxxxxxxxxxxxx>
>> Cc: "J. Bruce Fields" <bfields@xxxxxxxxxxxx>, "Chuck Lever" <chuck.lever@xxxxxxxxxx>, "linux-nfs"
>> <linux-nfs@xxxxxxxxxxxxxxx>
>> Sent: Monday, August 9, 2021 3:54:17 PM
>> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
> 
>> I just realized there’s one thing you should know. We run Cisco’s AMP for
>> Endpoints on the server. The goal is to detect malware that our users might put
>> on the file system. Typically one is worried about malware installed n client,
>> but we’re concerned that developers may be using java and python libraries with
>> known issues, and those will commonly be stored on the server.
>> 
>> If AMP is doing its job, it will check most new files. I’m not sure whether that
>> creates atypical usage or not.
>> 
>>> On Aug 9, 2021, at 2:56:15 PM, Timothy Pearson <tpearson@xxxxxxxxxxxxxxxxxxxxx>
>>> wrote:
>>> 
>>> Can confirm -- same general backtrace I sent in earlier.
>>> 
>>> That means the bug is:
>>> 1.) Not architecture specific
>>> 2.) Not filesystem specific
>>> 
>>> I was originally concerned it was related to BTRFS or POWER-specific, good to
>>> see it is not.
>>> 
>>> ----- Original Message -----
>>>> From: "hedrick" <hedrick@xxxxxxxxxxx>
>>>> To: "J. Bruce Fields" <bfields@xxxxxxxxxxxx>
>>>> Cc: "Timothy Pearson" <tpearson@xxxxxxxxxxxxxxxxxxxxx>, "Chuck Lever"
>>>> <chuck.lever@xxxxxxxxxx>, "linux-nfs"
>>>> <linux-nfs@xxxxxxxxxxxxxxx>
>>>> Sent: Monday, August 9, 2021 1:51:05 PM
>>>> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>>> 
>>>> I have. I was trying to avoid a reboot.
>>>> 
>>>> By the way, after the first failure, during reboot, syslog showed the following.
>>>> I’m unclear what it means, bu tit looks ike it might be from the failure
>>>> 
>>>> 
>>>> 
>>>>> On Aug 9, 2021, at 2:49 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>>>>> 
>>>>> On Mon, Aug 09, 2021 at 02:38:33PM -0400, hedrick@xxxxxxxxxxx wrote:
>>>>>> Does setting /proc/sys/fs/leases-enable to 0 work while the system is
>>>>>> up? I was expecting to see lslocks | grep DELE | wc go down. It’s not.
>>>>>> It’s staying around 1850.
>>>>> 
>>>>> All it should do is prevent giving out *new* delegations.
>>>>> 
>>>>> Best is to set that sysctl on system startup before nfsd starts.
>>>>> 
>>>>>>> On Aug 9, 2021, at 2:30 PM, Timothy Pearson
>>>>>>> <tpearson@xxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> FWIW that's *exactly* what we see.  Eventually, if the server is
>>>>>>> left alone for enough time, even the login system stops responding
>>>>>>> -- it's as if the I/O subsystem degrades and eventually blocks
>>>>>>> entirely.
>>>>> 
>>>>> That's pretty common behavior across a variety of kernel bugs.  So on
>>>>> its own it doesn't mean the root cause is the same.
>>>>> 
>>>>> --b.