On 24.11.22 14:00, Alexandra Winter wrote: > > > On 23.11.22 14:41, Tony Lu wrote: >> On Wed, Nov 23, 2022 at 02:13:04PM +0100, Jan Karcher wrote: >>> >>> >>> On 23/11/2022 12:53, Tony Lu wrote: >>>> On Wed, Nov 23, 2022 at 11:49:07AM +0100, Jan Karcher wrote: >>>>> The fixed commit changed the expected behavior of buffersizes >>>>> set by the user using the setsockopt mechanism. >>>>> Before the fixed patch the logic for determining the buffersizes used >>>>> was the following: >>>>> >>>>> default = net.ipv4.tcp_{w|r}mem[1] Jan, you explained to me: "the minima is 16Kib. This is enforced in smc_compress_bufsize which would move any value <= 16Kib into bucket 0 - which is 16KiB " net.ipv4.tcp_wmem[1] defaults to 8Kib. So in the default case (unchanged net.ipv4.tcp_wmem[1]) the default for the send path is not net.ipv4.tcp_wmem[1]. Should be clarified here. >>>>> sockopt = the setsockopt mechanism >>>>> val = the value assigned in default or via setsockopt >>>>> sk_buf = short for sk_{snd|rcv}buf >>>>> real_buf = the real size of the buffer (sk_buf_size in __smc_buf_create) >>>>> >>>>> exposed | net/core/sock.c | af_smc.c | smc_core.c >>>>> | | | >>>>> +---------+ | | +------------+ | +-------------------+ >>>>> | default |----------------------| sk_buf=val |---| real_buf=sk_buf/2 | >>>>> +---------+ | | +------------+ | +-------------------+ >>>>> | | | ^ >>>>> | | | | >>>>> +---------+ | +--------------+ | | | >>>>> | sockopt |---| sk_buf=val*2 |-----------------------| >>>>> +---------+ | +--------------+ | | >>>>> | | | >>>>> >>>>> The fixed patch introduced a dedicated sysctl for smc >>>>> and removed the /2 in smc_core.c resulting in the following flow: >>>>> >>>>> default = net.smc.{w|r}mem (which defaults to net.ipv4.tcp_{w|r}mem[1]) >>>>> sockopt = the setsockopt mechanism >>>>> val = the value assigned in default or via setsockopt >>>>> sk_buf = short for sk_{snd|rcv}buf >>>>> real_buf = the real size of the buffer (sk_buf_size in __smc_buf_create) >>>>> >>>>> exposed | net/core/sock.c | af_smc.c | smc_core.c >>>>> | | | >>>>> +---------+ | | +------------+ | +-----------------+ >>>>> | default |----------------------| sk_buf=val |---| real_buf=sk_buf | >>>>> +---------+ | | +------------+ | +-----------------+ >>>>> | | | ^ >>>>> | | | | >>>>> +---------+ | +--------------+ | | | >>>>> | sockopt |---| sk_buf=val*2 |-----------------------| >>>>> +---------+ | +--------------+ | | >>>>> | | | >>>>> >>>>> This would result in double of memory used for existing configurations >>>>> that are using setsockopt. >>>> >>>> Firstly, thanks for your detailed diagrams :-) >>>> >>>> And the original decision to use user-provided values rather than >>>> value/2 to follow the instructions of the socket manual [1]. >>>> >>>> SO_RCVBUF >>>> Sets or gets the maximum socket receive buffer in bytes. >>>> The kernel doubles this value (to allow space for >>>> bookkeeping overhead) when it is set using setsockopt(2), >>>> and this doubled value is returned by getsockopt(2). The >>>> default value is set by the >>>> /proc/sys/net/core/rmem_default file, and the maximum >>>> allowed value is set by the /proc/sys/net/core/rmem_max >>>> file. The minimum (doubled) value for this option is 256. >>>> >>>> [1] https://man7.org/linux/man-pages/man7/socket.7.html >>>> >>>> The user of SMC should know that setsockopt() with SO_{RCV|SND}BUF will >>> >>> I totally agree that an educated user of SMC should know about that behavior >>> if they decide to use it. >>> We do provide our users preload libraries where they can pass preferred >>> buffersizes via arguments and we handle the Sockopts for them. >>> >>>> double the values in kernel, and getsockopt() will return the doubled >>>> values. So that they should use half of the values which are passed to >>>> setsockopt(). The original patch tries to make things easier in SMC and >>>> let user-space to handle them following the socket manual. >>>> >>>>> SMC historically decided to use the explicit value given by the user >>>>> to allocate the memory. This is why we used the /2 in smc_core.c. >>>>> That logic was not applied to the default value. >>>> >>>> Yep, let back to the patch which introduced smc_{w|r}mem knobs, it's a >>>> trade-off to follow original logic of SMC, or follow the socket manual. >>>> We decides to follow the instruction of manuals in the end. >>> >>> I understand the point. I spend a lot of time trying to decide what to do. >>> >>> Since it was an intentional decision to not follow the general socket >>> option, and we do not have anyone complaining we do not really have a reason >>> to change it. >>> Changing it means that users with existing configurations would have to >>> change their configs on an update or suddenly expect double the memory >>> consumption. >>> That's why we in the end preffered to stay with the current logic. >> >> I can't agree with you more with the points to follow the historic logic >> and not break the user-space applications. >> >>> I'm thinking that maybe - if we stay with the historic logic - we should >>> document that desicion somewhere. So that in the future, if a user that >>> expects the man page behavior, has a way to understand what SMC is doing. >>> What do oyu think? >> >> Yep, we _really_ need to document it if we change the convention. >> Actually, I spent a lot of time to find the history about the logic of >> buffer (/2 and *2) in SMC. So I'm really in favor of adding >> documentation, at least code comments to help others to understand them. >> >> Cheers, >> Tony Lu > Iiuc you are changing the default values in this a patch and your other patch: > Default values for real_buf for send and receive: > > before 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") > real_buf=net.ipv4.tcp_{w|r}mem[1]/2 send: 8k recv: 64k see above: send: 16k recv: 64k > > after 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") > real_buf=net.ipv4.tcp_{w|r}mem[1] send: 16k (16*1024) recv: 128k (131072) > > after net/smc: Fix expected buffersizes and sync logic > real_buf=net.ipv4.tcp_{w|r}mem[1] send: 16k (16*1024) recv: 128k (131072) > > after net/smc: Unbind smc control from tcp control > real_buf=SMC_*BUF_INIT_SIZE send: 16k (16384) recv: 64k (65536) > > If my understanding is correct, then I nack this. > Defaults should be restored to the values before 0227f058aa29. > Otherwise users will notice a change in memory usage that needs to > be avoided or announced more explicitely. (and don't change them twice) See above, I stand corrected. However this patch fixes/restores the buffersize for setsockopt, but not for the default recieve path. Could you please clarify that in the title and description? Reviewed-by: Alexandra Winter <wintera@xxxxxxxxxxxxx> >> >>> - Jan >>> >>>> >>>> Cheers, >>>> Tony Lu >>>> >>>>> Since we now have our own sysctl, which is also exposed to the user, >>>>> we should sync the logic in a way that both values are the real value >>>>> used by our code and shown by smc_stats. To achieve this this patch >>>>> changes the behavior to: >>>>> >>>>> default = net.smc.{w|r}mem (which defaults to net.ipv4.tcp_{w|r}mem[1]) >>>>> sockopt = the setsockopt mechanism >>>>> val = the value assigned in default or via setsockopt >>>>> sk_buf = short for sk_{snd|rcv}buf >>>>> real_buf = the real size of the buffer (sk_buf_size in __smc_buf_create) >>>>> >>>>> exposed | net/core/sock.c | af_smc.c | smc_core.c >>>>> | | | >>>>> +---------+ | | +-------------+ | +-----------------+ >>>>> | default |----------------------| sk_buf=val*2|---|real_buf=sk_buf/2| >>>>> +---------+ | | +-------------+ | +-----------------+ >>>>> | | | ^ >>>>> | | | | >>>>> +---------+ | +--------------+ | | | >>>>> | sockopt |---| sk_buf=val*2 |------------------------| >>>>> +---------+ | +--------------+ | | >>>>> | | | >>>>> >>>>> This way both paths follow the same pattern and the expected behavior >>>>> is re-established. >>>>> >>>>> Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") >>>>> Signed-off-by: Jan Karcher <jaka@xxxxxxxxxxxxx> >>>>> Reviewed-by: Wenjia Zhang <wenjia@xxxxxxxxxxxxx> >>>>> --- >>>>> net/smc/af_smc.c | 9 +++++++-- >>>>> net/smc/smc_core.c | 8 ++++---- >>>>> 2 files changed, 11 insertions(+), 6 deletions(-) >>>>> >>>>> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c >>>>> index 036532cf39aa..a8c84e7bac99 100644 >>>>> --- a/net/smc/af_smc.c >>>>> +++ b/net/smc/af_smc.c >>>>> @@ -366,6 +366,7 @@ static void smc_destruct(struct sock *sk) >>>>> static struct sock *smc_sock_alloc(struct net *net, struct socket *sock, >>>>> int protocol) >>>>> { >>>>> + int buffersize_without_overhead; >>>>> struct smc_sock *smc; >>>>> struct proto *prot; >>>>> struct sock *sk; >>>>> @@ -379,8 +380,12 @@ static struct sock *smc_sock_alloc(struct net *net, struct socket *sock, >>>>> sk->sk_state = SMC_INIT; >>>>> sk->sk_destruct = smc_destruct; >>>>> sk->sk_protocol = protocol; >>>>> - WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(net->smc.sysctl_wmem)); >>>>> - WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(net->smc.sysctl_rmem)); >>>>> + buffersize_without_overhead = >>>>> + min_t(int, READ_ONCE(net->smc.sysctl_wmem), INT_MAX / 2); >>>>> + WRITE_ONCE(sk->sk_sndbuf, buffersize_without_overhead * 2); >>>>> + buffersize_without_overhead = >>>>> + min_t(int, READ_ONCE(net->smc.sysctl_rmem), INT_MAX / 2); >>>>> + WRITE_ONCE(sk->sk_rcvbuf, buffersize_without_overhead * 2); >>>>> smc = smc_sk(sk); >>>>> INIT_WORK(&smc->tcp_listen_work, smc_tcp_listen_work); >>>>> INIT_WORK(&smc->connect_work, smc_connect_work); >>>>> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c >>>>> index 00fb352c2765..36850a2ae167 100644 >>>>> --- a/net/smc/smc_core.c >>>>> +++ b/net/smc/smc_core.c >>>>> @@ -2314,10 +2314,10 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) >>>>> if (is_rmb) >>>>> /* use socket recv buffer size (w/o overhead) as start value */ >>>>> - sk_buf_size = smc->sk.sk_rcvbuf; >>>>> + sk_buf_size = smc->sk.sk_rcvbuf / 2; >>>>> else >>>>> /* use socket send buffer size (w/o overhead) as start value */ >>>>> - sk_buf_size = smc->sk.sk_sndbuf; >>>>> + sk_buf_size = smc->sk.sk_sndbuf / 2; >>>>> for (bufsize_short = smc_compress_bufsize(sk_buf_size, is_smcd, is_rmb); >>>>> bufsize_short >= 0; bufsize_short--) { >>>>> @@ -2376,7 +2376,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) >>>>> if (is_rmb) { >>>>> conn->rmb_desc = buf_desc; >>>>> conn->rmbe_size_short = bufsize_short; >>>>> - smc->sk.sk_rcvbuf = bufsize; >>>>> + smc->sk.sk_rcvbuf = bufsize * 2; >>>>> atomic_set(&conn->bytes_to_rcv, 0); >>>>> conn->rmbe_update_limit = >>>>> smc_rmb_wnd_update_limit(buf_desc->len); >>>>> @@ -2384,7 +2384,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) >>>>> smc_ism_set_conn(conn); /* map RMB/smcd_dev to conn */ >>>>> } else { >>>>> conn->sndbuf_desc = buf_desc; >>>>> - smc->sk.sk_sndbuf = bufsize; >>>>> + smc->sk.sk_sndbuf = bufsize * 2; >>>>> atomic_set(&conn->sndbuf_space, bufsize); >>>>> } >>>>> return 0; >>>>> -- >>>>> 2.34.1