On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote: > > Hi, > I just noticed this commit: > > commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc > Author: Olga Kornievskaia <aglo@xxxxxxxxxxxxxx> > Date: Tue Oct 21 14:13:47 2008 -0400 > > svcrpc: take advantage of tcp autotuning > > > which I must confess surprised me. I wonder if the full implications of > removing that functionality were understood. > > Previously nfsd would set the transmit buffer space for a connection to > ensure there is plenty to hold all replies. Now it doesn't. > > nfsd refuses to accept a request if there isn't enough space in the transmit > buffer to send a reply. This is important to ensure that each reply gets > sent atomically without blocking and there is no risk of replies getting > interleaved. > > The server starts out with a large estimate of the reply space (1M) and for > NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4 > it is much harder to estimate the space needed so it just assumes every > reply will require 1M of space. > > This means that with NFSv4, as soon as you have enough concurrent requests > such that 1M each reserves all of whatever window size was auto-tuned, new > requests on that connection will be ignored. > > This could significantly limit the amount of parallelism that can be achieved > for a single TCP connection (and given that the Linux client strongly prefers > a single connection now, this could become more of an issue). Worse, I believe it can deadlock completely if the transmit buffer shrinks too far, and people really have run into this: http://mid.gmane.org/<20130125185748.GC29596@xxxxxxxxxxxx> Trond's suggestion looked at the time like it might work and be doable: http://mid.gmane.org/<4FA345DA4F4AE44899BD2B03EEEC2FA91833C1D8@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> but I dropped it. The v4-specific situation might not be hard to improve: the v4 processing decodes the whole compound at the start, so it knows the sequence of ops before it does anything else and could compute a tighter bound on the reply size at that point. > I don't know if this is a real issue that needs addressing - I hit in the > context of a server filesystem which was misbehaving and so caused this issue > to become obvious. But in this case it is certainly the filesystem, not the > NFS server, which is causing the problem. Yeah it looks a real problem. Some good test cases would be useful if we could find some. And, yes, my screwup for merging 966043986 without solving those other problems first. I was confused. It does make a difference on high bandwidth-product networks (something people have also hit). I'd rather not regress there and also would rather not require manual tuning for something we should be able to get right automatically. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html