Re: Locking problems with Linux 4.9 and 4.11 with NFSD and `fs/iomap.c`

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 2 Aug 2017 08:51:44 +1000

On Tue, Aug 01, 2017 at 07:49:50PM +0200, Paul Menzel wrote:
> Dear Brian, dear Christoph,
> 
> 
> On 06/27/17 13:59, Paul Menzel wrote:
> 
> >Just a small update that we were hit by the problem on a different
> >machine (identical model) with Linux 4.9.32 and the exact same
> >symptoms.
> >
> >```
> >$ sudo cat /proc/2085/stack
> >[<ffffffff811f920c>] iomap_write_begin+0x8c/0x120
> >[<ffffffff811f982b>] iomap_zero_range_actor+0xeb/0x210
> >[<ffffffff811f9a82>] iomap_apply+0xa2/0x110
> >[<ffffffff811f9c58>] iomap_zero_range+0x58/0x80
> >[<ffffffff8133c7de>] xfs_zero_eof+0x4e/0xb0
> >[<ffffffff8133c9dd>] xfs_file_aio_write_checks+0x19d/0x1c0
> >[<ffffffff8133ce89>] xfs_file_buffered_aio_write+0x79/0x2d0
> >[<ffffffff8133d17e>] xfs_file_write_iter+0x9e/0x150
> >[<ffffffff81198dc0>] do_iter_readv_writev+0xa0/0xf0
> >[<ffffffff81199fba>] do_readv_writev+0x18a/0x230
> >[<ffffffff8119a2ac>] vfs_writev+0x3c/0x50
> >[<ffffffffffffffff>] 0xffffffffffffffff
> >```
> >
> >We haven’t had time to set up a test system yet to analyze that further.
> 
> Today, two systems with Linux 4.9.23 exhibited the problem of `top`
> showing that `nfsd` is at 100 %. Restarting one machine into Linux
> *4.9.38* showed the same problem. One of them with a 1 GBit/s
> network device got traffic from a 10 GBit/s system, so the
> connection was saturated.

So the question is this: is there IO being issued here, is the page
cache growing, or is it in a tight loop doing nothing? Details of
your hardware, XFS config and NFS server config is kinda important
here, too.

For example, if the NFS server IO patterns trigger a large
speculative delayed allocation, then the client does a write at the
end of the speculative delalloc range, we will zero the entire
speculative delalloc range. That could be several GB of zeros that
need to be written here. It's sub-optimal, yes, and but large
zeroing is rare enough that we haven't needed to optimise it by
allocating unwritten extents instead.  It would be really handy to
know what application the NFS client is running as that might give
insight into the trigger behaviour and whether you are hitting this
case.

Also, if the NFS client is only writing to one file, then all the
other writes that are on the wire will end up being serviced by nfsd
threads that then block waiting for the inode lock. If the client
issues more writes on the wire thant he NFS server has worker
threads, the client side write will starve the NFS server of
worker threads until the zeroing completes. This is the behaviour
you are seeing - it's a common server side config error that's been
known for at least 15 years...

FWIW, it used to be that a linux NFS client could have 16 concurrent
outstanding NFS RPCs to a server at a time - I don't know if that
limit still exists or whether it's been increased. However, the
typical knfsd default is (still) only 8 worker threads, meaning a
single client and server using default configs can cause the above
server DOS issue. e.g on a bleeding edge debian distro install:

$ head -2 /etc/default/nfs-kernel-server
# Number of servers to start up
RPCNFSDCOUNT=8
$

So, yeah, distros still only configure the nfs server with 8 worker
thread by default. If it's a dedicated NFS server, then I'd be using
somewhere around 64 NFSD threads *per CPU* as a starting point for
the server config...

At minimum, you need to ensure that the NFS server has at least
double the number of server threads as the largest client side
concurrent RPC count so that a single client can't DOS the NFS
server with a single blocked write stream.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html