Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Mon, 9 Aug 2021 17:37:19 +0000

> On Aug 9, 2021, at 1:15 PM, hedrick@xxxxxxxxxxx wrote:
> 
> There seems to be a soft lockup message on the console, but that’s all I can find.

Then when you say "server hangs" you mean that the entire NFS server
system deadlocks. It's not just unresponsive on one or more exports.

A soft lockup is typically caused by a segmentation fault in code
that is not running in process context.

> I’m currently considering whether it’s best to move to NFS 4.0, which seems not to cause the issue, or 4.2 with delegations disabled. This is the primary server for the department. If it fails, everything fails, VMs because read-only, user jobs fai, etc.
> 
> We ran for a year before this showed up, so I’m pretty sure going to 4.0 will fix it. But I have use cases for ACLs that will only work with 4.2. Since the problem seems to be in the callback mechanism, and as far as I can tell that’s only used for delegations, I assume turning off delegations will fix it.

In NFSv4.1 and later, the callback channel is also used for pNFS. It
can also be used for lock notification in all minor versions.

Disabling delegation can have a performance impact, but it depends on
the nature of your workloads and whether files are shared amongst
your client population.

> We’ve also had a history of issues with 4.2 problems on clients. That’s why we backed off to 4.0 initially. Clients were seeing hangs.

Let's stick with the server issue for the moment.

Enabling some tracepoints might give us more insight, though if the
server then crashes we would be hard pressed to examine the trace
records. If it's pretty common to get multiple receive_cb_reply
error messages in a short time space, you might enable a triggered
tracepoint in that function to start a 60-second tcpdump capture to
a file.

--
Chuck Lever