Dear Bob, thanks a lot for your input. > >>>> That was our first idea too, but we haven't found any indication that this is the case. The xfs file systems seem perfectly fine when all nfsds are in D state, and we can > >>>> read from them and write to them. If xfs were to block nfs IO, this should > >>>> affect other processes too, right? > >>> It's possible that the NFSD threads are waiting on I/O to a particular filesystem block. XFS is not likely to block other activity in this case. > >> ok good to know. So far we were under the impression that a file system would > >> block as a whole. > > > > XFS tries to operate in parallel as much as it can. Maybe other filesystems aren't as capable. > > > > If the unresponsive block is part of a superblock or the journal (ie, shared metadata) I would expect XFS to become unresponsive. For I/O on blocks containing file data, it is likely to have more robust behavior. > > > > Pretty sure we have seen a similar issue - never fully explained. From what I recall, the server gets to a low memory state. At that point, efforts to coalesce writes are abandoned, and each write request is processed in line - vs scheduled - all nfsd's then pile up in D. writes continue to arrive at a rate higher than can keep up. But, the back end store (a high end netapp raid 6 w/240 drives also with xfs) had very little load - not too busy. Never fully explained it - but Chucks point on shared metadata block may be good place to look - and whether in-line write at low memory could have synergy. IIRC, worked around with releases and tunables like minfree kmem et.al. , that came into play to reduce - but not eliminate. I'm away from reference material for a while but I'll review and update if I find anything. we'll certainly investigate this topic, but right now it's kinda hard to imagine since I've never seen the file server above ~10G of its 64G of RAM (excluding page cache of course). We're not even sure heavy writes trigger the problem, in one case our monitoring hinted at a lot of reads leading up to the freeze. OTOH if our issue could be resolved by throwing a bunch of RAM bars into the server, all the better. thanks, -Christian