Re: sendmsg blocking with sendtimeout vs. non-blocking

"Steve French" <smfrench@xxxxxxxxx> · Thu, 23 Oct 2008 14:19:04 -0500

On Thu, Oct 23, 2008 at 1:00 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> On Thu, 23 Oct 2008 11:42:49 -0500
> "Steve French" <smfrench@xxxxxxxxx> wrote:
>
>> On Thu, Oct 23, 2008 at 11:26 AM, Shirish Pargaonkar
>> <shirishpargaonkar@xxxxxxxxx> wrote:
>> > On Thu, Oct 23, 2008 at 9:54 AM, Steve French <smfrench@xxxxxxxxx> wrote:
>> >>> The big question is what happens under heavy memory pressure. In that
>> >>> case we may not be able to copy the data from the buffer or iov into
>> >>> the skbuf. With non-blocking I/O kernel_sendmsg will return error
>> >>> if nothing was copied (probably -EAGAIN or -ENOSPC). If some of the
>> >>> data could be copied, then it'll return a positive value that was less
>> >>> than the size of the send.
>> I think we are ok, as long as in all paths we either:
>> a) resend the remainder of the SMB (by retrying what is left in the buffer
>> or
>> b) kill the tcp session, reconnect (and either retry with a new handle
>> or return a EHOSTDOWN error to the vfs depending on the path)
>>
>> >>
>> >>> Regardless of whether we use blocking I/O or not, kernel_sendmsg could
>> >>> still copy less data than was requested. The patch needs to account
>> >>> for this, probably by retrying the send with the remaining data if it
>> >>> occurs.
>> >> Yes - if that is true we need to retry at least once on this.
>> >>
>> >>> I don't think Shirish's patch is likely fixing this problem, but is
>> >>> probably just making it harder for it to occur.
>> >>>
>> >>> I think, overall that blocking I/O makes sense. It should make error
>> >>> handling easier and I don't see any real disadvantage. We might
>> >>> consider using longer timeouts, removing the MSG_NOSIGNAL flag and
>> >>> changing the code to deal with the possibility of catching a signal
>> >>> during the kernel_sendmsg call.
>> >>>
>> >>> Also, we're setting the sk_sndbuf and sk_rcvbuf in ipv4_connect.
>> >>> According to Neil, that may be hurting us here. This has the effect of
>> >>> placing a hard limit on the send buffer for the socket, and may be
>> >>> making it more likely that we get an error back from kernel_sendmsg.
>> >> Yes - Sridhar and Dave Stevens mentioned that too.   We haven't needed
>> >> that (setting sndbuf) for at least a few kernel versions (is the auto socket
>> >> buffer size tuning working as far back as RHEL5/2.6.18 and
>> >> SLES10 2.6.16 too?)
>> >>
>> >> Shirish,
>> >> I thought that that was part of your patch?
>> >
>> > No it is not but this is an easy change, just do not set them.
>>
>> As I look into this sndbuf and rcvbuf size setting ... what concerns
>> me is why nfs sets these sizes for snd and rcvbuf sizes still if they
>> don't need to be set?  We (cifs) have larger write sizes (56K) than
>> nfs's default.   See svc_set_sockbufsize in net/sunrpc/svcsock.c
>>
>
> AFAICT, those size settings have been there for a long time (2002?). My
> guess is that this may predate the socket autotuning stuff that Neil H.
> mentioned to me.
>
> His statement was basically "unless you know for sure that you don't
> want to use more than X amount of memory, then there isn't much reason
> to set the send and receive buffers".

I think that there is a problem still - cifs needs tcp autotuning but
the buffers probably need to be at least as big as the largest SMB
response (about 17K on receives, but configurable to be larger).   See
comment below about NFS:

On Thu, Oct 23, 2008 at 2:05 PM, Jim Rees <rees@xxxxxxxxx> wrote:
> There are two issues to be aware of.  One is that the socket buffers have to
> be big enough for the tcp congestion window.  In the old days, the
> application would have to know ahead of time how big this is, and call
> setsockopt(), which sets these numbers.
>
> Now however, the tcp stack "autotunes" the buffer sizes to the correct
> amount.  If you call setsockopt() to set a buffer size, or set sk_*buf, or
> set the SOCK_*BUF_LOCK bits in sk_userlocks, you defeat this autotuning.
> This is almost always a bad idea.
>
> The other issue is that at least for NFS, the receive buffer must be big
> enough to hold the biggest possible rpc.  If not, a partial rpc will get
> stuck in the buffer and no progress will be made.
>
> Issue one is easy to deal with, just don't muck with the socket internal
> data structure.  The second one is harder.  What's really needed is a new
> api into the tcp layer that will reserve a minimum amount of memory in the
> socket buffer so that receives won't stall.  For now, our patch sets the
> sk_*buf values without setting the lock flags.

-- 
Thanks,

Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html