I'm not sure that is much different than the load patterns we end up generating, with mixed remote and local I/O. I'd think that such a scenario is fairly typical, especially when factoring in backup processes. ----- Original Message ----- > From: "hedrick" <hedrick@xxxxxxxxxxx> > To: "Timothy Pearson" <tpearson@xxxxxxxxxxxxxxxxxxxxx> > Cc: "J. Bruce Fields" <bfields@xxxxxxxxxxxx>, "Chuck Lever" <chuck.lever@xxxxxxxxxx>, "linux-nfs" > <linux-nfs@xxxxxxxxxxxxxxx> > Sent: Monday, August 9, 2021 3:54:17 PM > Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load > I just realized there’s one thing you should know. We run Cisco’s AMP for > Endpoints on the server. The goal is to detect malware that our users might put > on the file system. Typically one is worried about malware installed n client, > but we’re concerned that developers may be using java and python libraries with > known issues, and those will commonly be stored on the server. > > If AMP is doing its job, it will check most new files. I’m not sure whether that > creates atypical usage or not. > >> On Aug 9, 2021, at 2:56:15 PM, Timothy Pearson <tpearson@xxxxxxxxxxxxxxxxxxxxx> >> wrote: >> >> Can confirm -- same general backtrace I sent in earlier. >> >> That means the bug is: >> 1.) Not architecture specific >> 2.) Not filesystem specific >> >> I was originally concerned it was related to BTRFS or POWER-specific, good to >> see it is not. >> >> ----- Original Message ----- >>> From: "hedrick" <hedrick@xxxxxxxxxxx> >>> To: "J. Bruce Fields" <bfields@xxxxxxxxxxxx> >>> Cc: "Timothy Pearson" <tpearson@xxxxxxxxxxxxxxxxxxxxx>, "Chuck Lever" >>> <chuck.lever@xxxxxxxxxx>, "linux-nfs" >>> <linux-nfs@xxxxxxxxxxxxxxx> >>> Sent: Monday, August 9, 2021 1:51:05 PM >>> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load >> >>> I have. I was trying to avoid a reboot. >>> >>> By the way, after the first failure, during reboot, syslog showed the following. >>> I’m unclear what it means, bu tit looks ike it might be from the failure >>> >>> >>> >>>> On Aug 9, 2021, at 2:49 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote: >>>> >>>> On Mon, Aug 09, 2021 at 02:38:33PM -0400, hedrick@xxxxxxxxxxx wrote: >>>>> Does setting /proc/sys/fs/leases-enable to 0 work while the system is >>>>> up? I was expecting to see lslocks | grep DELE | wc go down. It’s not. >>>>> It’s staying around 1850. >>>> >>>> All it should do is prevent giving out *new* delegations. >>>> >>>> Best is to set that sysctl on system startup before nfsd starts. >>>> >>>>>> On Aug 9, 2021, at 2:30 PM, Timothy Pearson >>>>>> <tpearson@xxxxxxxxxxxxxxxxxxxxx> wrote: >>>>>> >>>>>> FWIW that's *exactly* what we see. Eventually, if the server is >>>>>> left alone for enough time, even the login system stops responding >>>>>> -- it's as if the I/O subsystem degrades and eventually blocks >>>>>> entirely. >>>> >>>> That's pretty common behavior across a variety of kernel bugs. So on >>>> its own it doesn't mean the root cause is the same. >>>> > >>> --b.