Re: [PATCH net] net/smc: Fix expected buffersizes and sync logic

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Nov 25, 2022 at 11:59:46AM +0100, Alexandra Winter wrote:
> 
> 
> On 25.11.22 08:05, Tony Lu wrote:
> > On Fri, Nov 25, 2022 at 07:15:33AM +0100, Jan Karcher wrote:
> >>
> >>
> >> On 24/11/2022 15:07, Alexandra Winter wrote:
> >>>
> >>>
> >>> On 24.11.22 14:00, Alexandra Winter wrote:
> >>>>
> >>>>
> >> [ ... ]>>>>> On Wed, Nov 23, 2022 at 11:49:07AM +0100, Jan Karcher wrote:
> >>>>>>>> The fixed commit changed the expected behavior of buffersizes
> >>>>>>>> set by the user using the setsockopt mechanism.
> >>>>>>>> Before the fixed patch the logic for determining the buffersizes used
> >>>>>>>> was the following:
> >>>>>>>>
> >>>>>>>> default  = net.ipv4.tcp_{w|r}mem[1]
> >>> Jan, you explained to me: "the minima is 16Kib. This is enforced in smc_compress_bufsize
> >>> which would move any value <= 16Kib into bucket 0 - which is 16KiB "
> >>> net.ipv4.tcp_wmem[1] defaults to 8Kib. So in the default case (unchanged net.ipv4.tcp_wmem[1])
> >>> the default for the send path is not net.ipv4.tcp_wmem[1]. Should be clarified here.
> >>
> >> The default value is still set to the net.ipv4.tcp_{w|r}mem[1]. This is a
> >> *very* top level overview about what is happening and *not* a documentation.
> >> I don't really want to explain the full code flow here.
> >>
> >> What we still should do - as Tony aggreed on - is documenting the SMC
> >> behavior. This is a follow up on my list.
> > 
> > Hello Jan and Alexandra,
> > 
> > It looks like the misalignment of information is causing some trouble,
> > which is introduced by my patch. Maybe we could have an off-maillist and
> > online meeting to discussion?
> > 
> > We have some progress updates of scalability, and we are really like the
> > extension of SMC-D. Also we have some ideas for SMC, in case of
> > misalignment of information, we'd like to put them on the table and
> > discuss them earlier. Maybe an online meeting is an efficient way. What
> > do you think?
> > 
> > If possible, I would prepared the meetings and topics and send them to
> > everyone first.
> > 
> > Cheers,
> > Tony Lu
> > 
> 
> Thanks a lot for your constructive proposals Tony. Yes, we should have a discussion off-mailinglist
> about future topics.

I will prepare the discussion off-maillinglist ASAP. The email will be
sent out when it's ready. And Jan, What about your opinion?

Cheers,
Tony Lu

> 
> My remaining concern for this fix is the default values (user does not use setsockopt, nor 
> changes the new sysfs parameters, nor changes tcp defaults):
> >>>> before 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
> 	    send: 16k recv: 64k
> >>>> after net/smc: Fix expected buffersizes and sync logic   (this patch)
> >>>>       send: 16k recv: 128k
> 
> @Jan, as this is the only patch you want to send to net, please change the default size of
> the receive buffers back to 64k (I don't care how).
> 
> 
> >>
> >>>>>>>> sockopt  = the setsockopt mechanism
> >>>>>>>> val      = the value assigned in default or via setsockopt
> >>>>>>>> sk_buf   = short for sk_{snd|rcv}buf
> >>>>>>>> real_buf = the real size of the buffer (sk_buf_size in __smc_buf_create)
> >>>>>>>>
> >>>>>>>>     exposed   | net/core/sock.c  |    af_smc.c    |  smc_core.c
> >>>>>>>>               |                  |                |
> >>>>>>>> +---------+ |                  | +------------+ | +-------------------+
> >>>>>>>> | default |----------------------| sk_buf=val |---| real_buf=sk_buf/2 |
> >>>>>>>> +---------+ |                  | +------------+ | +-------------------+
> >>>>>>>>               |                  |                |    ^
> >>>>>>>>               |                  |                |    |
> >>>>>>>> +---------+ | +--------------+ |                |    |
> >>>>>>>> | sockopt |---| sk_buf=val*2 |-----------------------|
> >>>>>>>> +---------+ | +--------------+ |                |
> >>>>>>>>               |                  |                |
> >>>>>>>>
> >>>>>>>> The fixed patch introduced a dedicated sysctl for smc
> >>>>>>>> and removed the /2 in smc_core.c resulting in the following flow:
> >>>>>>>>
> >>>>>>>> default  = net.smc.{w|r}mem (which defaults to net.ipv4.tcp_{w|r}mem[1])
> >>>>>>>> sockopt  = the setsockopt mechanism
> >>>>>>>> val      = the value assigned in default or via setsockopt
> >>>>>>>> sk_buf   = short for sk_{snd|rcv}buf
> >>>>>>>> real_buf = the real size of the buffer (sk_buf_size in __smc_buf_create)
> >>>>>>>>
> >>>>>>>>     exposed   | net/core/sock.c  |    af_smc.c    |  smc_core.c
> >>>>>>>>               |                  |                |
> >>>>>>>> +---------+ |                  | +------------+ | +-----------------+
> >>>>>>>> | default |----------------------| sk_buf=val |---| real_buf=sk_buf |
> >>>>>>>> +---------+ |                  | +------------+ | +-----------------+
> >>>>>>>>               |                  |                |    ^
> >>>>>>>>               |                  |                |    |
> >>>>>>>> +---------+ | +--------------+ |                |    |
> >>>>>>>> | sockopt |---| sk_buf=val*2 |-----------------------|
> >>>>>>>> +---------+ | +--------------+ |                |
> >>>>>>>>               |                  |                |
> >>>>>>>>
> >>>>>>>> This would result in double of memory used for existing configurations
> >>>>>>>> that are using setsockopt.
> >>>>>>>
> >>>>>>> Firstly, thanks for your detailed diagrams :-)
> >>>>>>>
> >>>>>>> And the original decision to use user-provided values rather than
> >>>>>>> value/2 to follow the instructions of the socket manual [1].
> >>>>>>>
> >>>>>>>     SO_RCVBUF
> >>>>>>>            Sets or gets the maximum socket receive buffer in bytes.
> >>>>>>>            The kernel doubles this value (to allow space for
> >>>>>>>            bookkeeping overhead) when it is set using setsockopt(2),
> >>>>>>>            and this doubled value is returned by getsockopt(2).  The
> >>>>>>>            default value is set by the
> >>>>>>>            /proc/sys/net/core/rmem_default file, and the maximum
> >>>>>>>            allowed value is set by the /proc/sys/net/core/rmem_max
> >>>>>>>            file.  The minimum (doubled) value for this option is 256.
> >>>>>>>
> >>>>>>> [1] https://man7.org/linux/man-pages/man7/socket.7.html
> >>>>>>>
> >>>>>>> The user of SMC should know that setsockopt() with SO_{RCV|SND}BUF will
> >>>>>>
> >>>>>> I totally agree that an educated user of SMC should know about that behavior
> >>>>>> if they decide to use it.
> >>>>>> We do provide our users preload libraries where they can pass preferred
> >>>>>> buffersizes via arguments and we handle the Sockopts for them.
> >>>>>>
> >>>>>>> double the values in kernel, and getsockopt() will return the doubled
> >>>>>>> values. So that they should use half of the values which are passed to
> >>>>>>> setsockopt(). The original patch tries to make things easier in SMC and
> >>>>>>> let user-space to handle them following the socket manual.
> >>>>>>>
> >>>>>>>> SMC historically decided to use the explicit value given by the user
> >>>>>>>> to allocate the memory. This is why we used the /2 in smc_core.c.
> >>>>>>>> That logic was not applied to the default value.
> >>>>>>>
> >>>>>>> Yep, let back to the patch which introduced smc_{w|r}mem knobs, it's a
> >>>>>>> trade-off to follow original logic of SMC, or follow the socket manual.
> >>>>>>> We decides to follow the instruction of manuals in the end.
> >>>>>>
> >>>>>> I understand the point. I spend a lot of time trying to decide what to do.
> >>>>>>
> >>>>>> Since it was an intentional decision to not follow the general socket
> >>>>>> option, and we do not have anyone complaining we do not really have a reason
> >>>>>> to change it.
> >>>>>> Changing it means that users with existing configurations would have to
> >>>>>> change their configs on an update or suddenly expect double the memory
> >>>>>> consumption.
> >>>>>> That's why we in the end preffered to stay with the current logic.
> >>>>>
> >>>>> I can't agree with you more with the points to follow the historic logic
> >>>>> and not break the user-space applications.
> >>>>>
> >>>>>> I'm thinking that maybe - if we stay with the historic logic - we should
> >>>>>> document that desicion somewhere. So that in the future, if a user that
> >>>>>> expects the man page behavior, has a way to understand what SMC is doing.
> >>>>>> What do oyu think?
> >>>>>
> >>>>> Yep, we _really_ need to document it if we change the convention.
> >>>>> Actually, I spent a lot of time to find the history about the logic of
> >>>>> buffer (/2 and *2) in SMC. So I'm really in favor of adding
> >>>>> documentation, at least code comments to help others to understand them.
> >>>>>
> >>>>> Cheers,
> >>>>> Tony Lu
> >>>> Iiuc you are changing the default values in this a patch and your other patch:
> >>>> Default values for real_buf for send and receive:
> >>>>
> >>>> before 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
> >>>>      real_buf=net.ipv4.tcp_{w|r}mem[1]/2   send: 8k  recv: 64k
> >>>        see above: 			    send: 16k recv: 64k
> >>>> after 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
> >>>> real_buf=net.ipv4.tcp_{w|r}mem[1]   send: 16k (16*1024) recv: 128k (131072)
> >>>>
> >>>> after net/smc: Fix expected buffersizes and sync logic
> >>>> real_buf=net.ipv4.tcp_{w|r}mem[1]   send: 16k (16*1024) recv: 128k (131072)
> >>>>
> >>>> after net/smc: Unbind smc control from tcp control
> >>>> real_buf=SMC_*BUF_INIT_SIZE   send: 16k (16384) recv: 64k (65536)
> >>>>
> >>>> If my understanding is correct, then I nack this.
> >>>> Defaults should be restored to the values before 0227f058aa29.
> >>>> Otherwise users will notice a change in memory usage that needs to
> >>>> be avoided or announced more explicitely. (and don't change them twice)
> >>> See above, I stand corrected. However this patch fixes/restores the buffersize
> >>> for setsockopt, but not for the default recieve path.
> >>> Could you please clarify that in the title and description?
> >>>
> >>
> >> I am trying to keep the commit title as crisp as possible while providing
> >> enough information and set the context in the commit message:
> >>
> >> "The fixed commit changed the expected behavior of buffersizes set by the
> >> user using the setsockopt mechanism."
> >>
> >>  + There is now a whole e-mail thread to consult in case of any further
> >> questions.
> >>
> >> Thank you for your comments
> >> - Jan
> >>
> >>> Reviewed-by: Alexandra Winter <wintera@xxxxxxxxxxxxx>
> >>>>>> - Jan
> >>>>>>
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Tony Lu
> >>>>>>>
> >>>>>>>> Since we now have our own sysctl, which is also exposed to the user,
> >>>>>>>> we should sync the logic in a way that both values are the real value
> >>>>>>>> used by our code and shown by smc_stats. To achieve this this patch
> >>>>>>>> changes the behavior to:
> >>>>>>>>
> >>>>>>>> default  = net.smc.{w|r}mem (which defaults to net.ipv4.tcp_{w|r}mem[1])
> >>>>>>>> sockopt  = the setsockopt mechanism
> >>>>>>>> val      = the value assigned in default or via setsockopt
> >>>>>>>> sk_buf   = short for sk_{snd|rcv}buf
> >>>>>>>> real_buf = the real size of the buffer (sk_buf_size in __smc_buf_create)
> >>>>>>>>
> >>>>>>>>     exposed   | net/core/sock.c  |    af_smc.c     |  smc_core.c
> >>>>>>>>               |                  |                 |
> >>>>>>>> +---------+ |                  | +-------------+ | +-----------------+
> >>>>>>>> | default |----------------------| sk_buf=val*2|---|real_buf=sk_buf/2|
> >>>>>>>> +---------+ |                  | +-------------+ | +-----------------+
> >>>>>>>>               |                  |                 |    ^
> >>>>>>>>               |                  |                 |    |
> >>>>>>>> +---------+ | +--------------+ |                 |    |
> >>>>>>>> | sockopt |---| sk_buf=val*2 |------------------------|
> >>>>>>>> +---------+ | +--------------+ |                 |
> >>>>>>>>               |                  |                 |
> >>>>>>>>
> >>>>>>>> This way both paths follow the same pattern and the expected behavior
> >>>>>>>> is re-established.
> >>>>>>>>
> >>>>>>>> Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
> >>>>>>>> Signed-off-by: Jan Karcher <jaka@xxxxxxxxxxxxx>
> >>>>>>>> Reviewed-by: Wenjia Zhang <wenjia@xxxxxxxxxxxxx>
> >>>>>>>> ---
> >>>>>>>>    net/smc/af_smc.c   | 9 +++++++--
> >>>>>>>>    net/smc/smc_core.c | 8 ++++----
> >>>>>>>>    2 files changed, 11 insertions(+), 6 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> >>>>>>>> index 036532cf39aa..a8c84e7bac99 100644
> >>>>>>>> --- a/net/smc/af_smc.c
> >>>>>>>> +++ b/net/smc/af_smc.c
> >>>>>>>> @@ -366,6 +366,7 @@ static void smc_destruct(struct sock *sk)
> >>>>>>>>    static struct sock *smc_sock_alloc(struct net *net, struct socket *sock,
> >>>>>>>>    				   int protocol)
> >>>>>>>>    {
> >>>>>>>> +	int buffersize_without_overhead;
> >>>>>>>>    	struct smc_sock *smc;
> >>>>>>>>    	struct proto *prot;
> >>>>>>>>    	struct sock *sk;
> >>>>>>>> @@ -379,8 +380,12 @@ static struct sock *smc_sock_alloc(struct net *net, struct socket *sock,
> >>>>>>>>    	sk->sk_state = SMC_INIT;
> >>>>>>>>    	sk->sk_destruct = smc_destruct;
> >>>>>>>>    	sk->sk_protocol = protocol;
> >>>>>>>> -	WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(net->smc.sysctl_wmem));
> >>>>>>>> -	WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(net->smc.sysctl_rmem));
> >>>>>>>> +	buffersize_without_overhead =
> >>>>>>>> +		min_t(int, READ_ONCE(net->smc.sysctl_wmem), INT_MAX / 2);
> >>>>>>>> +	WRITE_ONCE(sk->sk_sndbuf, buffersize_without_overhead * 2);
> >>>>>>>> +	buffersize_without_overhead =
> >>>>>>>> +		min_t(int, READ_ONCE(net->smc.sysctl_rmem), INT_MAX / 2);
> >>>>>>>> +	WRITE_ONCE(sk->sk_rcvbuf, buffersize_without_overhead * 2);
> >>>>>>>>    	smc = smc_sk(sk);
> >>>>>>>>    	INIT_WORK(&smc->tcp_listen_work, smc_tcp_listen_work);
> >>>>>>>>    	INIT_WORK(&smc->connect_work, smc_connect_work);
> >>>>>>>> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> >>>>>>>> index 00fb352c2765..36850a2ae167 100644
> >>>>>>>> --- a/net/smc/smc_core.c
> >>>>>>>> +++ b/net/smc/smc_core.c
> >>>>>>>> @@ -2314,10 +2314,10 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
> >>>>>>>>    	if (is_rmb)
> >>>>>>>>    		/* use socket recv buffer size (w/o overhead) as start value */
> >>>>>>>> -		sk_buf_size = smc->sk.sk_rcvbuf;
> >>>>>>>> +		sk_buf_size = smc->sk.sk_rcvbuf / 2;
> >>>>>>>>    	else
> >>>>>>>>    		/* use socket send buffer size (w/o overhead) as start value */
> >>>>>>>> -		sk_buf_size = smc->sk.sk_sndbuf;
> >>>>>>>> +		sk_buf_size = smc->sk.sk_sndbuf / 2;
> >>>>>>>>    	for (bufsize_short = smc_compress_bufsize(sk_buf_size, is_smcd, is_rmb);
> >>>>>>>>    	     bufsize_short >= 0; bufsize_short--) {
> >>>>>>>> @@ -2376,7 +2376,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
> >>>>>>>>    	if (is_rmb) {
> >>>>>>>>    		conn->rmb_desc = buf_desc;
> >>>>>>>>    		conn->rmbe_size_short = bufsize_short;
> >>>>>>>> -		smc->sk.sk_rcvbuf = bufsize;
> >>>>>>>> +		smc->sk.sk_rcvbuf = bufsize * 2;
> >>>>>>>>    		atomic_set(&conn->bytes_to_rcv, 0);
> >>>>>>>>    		conn->rmbe_update_limit =
> >>>>>>>>    			smc_rmb_wnd_update_limit(buf_desc->len);
> >>>>>>>> @@ -2384,7 +2384,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
> >>>>>>>>    			smc_ism_set_conn(conn); /* map RMB/smcd_dev to conn */
> >>>>>>>>    	} else {
> >>>>>>>>    		conn->sndbuf_desc = buf_desc;
> >>>>>>>> -		smc->sk.sk_sndbuf = bufsize;
> >>>>>>>> +		smc->sk.sk_sndbuf = bufsize * 2;
> >>>>>>>>    		atomic_set(&conn->sndbuf_space, bufsize);
> >>>>>>>>    	}
> >>>>>>>>    	return 0;
> >>>>>>>> -- 
> >>>>>>>> 2.34.1



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Kernel Development]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite Info]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Linux Media]     [Device Mapper]

  Powered by Linux