Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I just realized there’s one thing you should know. We run Cisco’s AMP for Endpoints on the server. The goal is to detect malware that our users might put on the file system. Typically one is worried about malware installed n client, but we’re concerned that developers may be using java and python libraries with known issues, and those will commonly be stored on the server.

If AMP is doing its job, it will check most new files. I’m not sure whether that creates atypical usage or not.

> On Aug 9, 2021, at 2:56:15 PM, Timothy Pearson <tpearson@xxxxxxxxxxxxxxxxxxxxx> wrote:
> 
> Can confirm -- same general backtrace I sent in earlier.
> 
> That means the bug is:
> 1.) Not architecture specific
> 2.) Not filesystem specific
> 
> I was originally concerned it was related to BTRFS or POWER-specific, good to see it is not.
> 
> ----- Original Message -----
>> From: "hedrick" <hedrick@xxxxxxxxxxx>
>> To: "J. Bruce Fields" <bfields@xxxxxxxxxxxx>
>> Cc: "Timothy Pearson" <tpearson@xxxxxxxxxxxxxxxxxxxxx>, "Chuck Lever" <chuck.lever@xxxxxxxxxx>, "linux-nfs"
>> <linux-nfs@xxxxxxxxxxxxxxx>
>> Sent: Monday, August 9, 2021 1:51:05 PM
>> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
> 
>> I have. I was trying to avoid a reboot.
>> 
>> By the way, after the first failure, during reboot, syslog showed the following.
>> I’m unclear what it means, bu tit looks ike it might be from the failure
>> 
>> 
>> 
>>> On Aug 9, 2021, at 2:49 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>>> 
>>> On Mon, Aug 09, 2021 at 02:38:33PM -0400, hedrick@xxxxxxxxxxx wrote:
>>>> Does setting /proc/sys/fs/leases-enable to 0 work while the system is
>>>> up? I was expecting to see lslocks | grep DELE | wc go down. It’s not.
>>>> It’s staying around 1850.
>>> 
>>> All it should do is prevent giving out *new* delegations.
>>> 
>>> Best is to set that sysctl on system startup before nfsd starts.
>>> 
>>>>> On Aug 9, 2021, at 2:30 PM, Timothy Pearson
>>>>> <tpearson@xxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>> 
>>>>> FWIW that's *exactly* what we see.  Eventually, if the server is
>>>>> left alone for enough time, even the login system stops responding
>>>>> -- it's as if the I/O subsystem degrades and eventually blocks
>>>>> entirely.
>>> 
>>> That's pretty common behavior across a variety of kernel bugs.  So on
>>> its own it doesn't mean the root cause is the same.
>>> 
>>> --b.





[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux