On Wed, Jul 24, 2024 at 8:04 AM <jiang.kun2@xxxxxxxxxx> wrote: > > From: Fan Yu <fan.yu9@xxxxxxxxxx> > > The Importance of Following IANA Standards > ======================================== > IANA specifies User ports as 1024-49151, and it just so happens > that my application uses port 33060 (reserved for MySQL Database Extended), > which conflicts with the Linux default dynamic port range (32768-60999)[1]. > > In fact, IANA assigns numbers in port range from 32768 to 49151, > which is uniformly accepted by the industry. To do this, > it is necessary for the kernel to follow the IANA specification. > > Drawbacks of existing implementations > ======================================== > In past discussions, follow the IANA specification by modifying the > system defaults has been discouraged, which would greatly affect > existing users[2]. > > Theoretically, this can be done by tuning net.ipv4.local_port_range, > but there are inconveniences such as: > (1) For cloud-native scenarios, each container is expected to follow > the IANA specification uniformly, so it is necessary to do sysctl > configuration in each container individually, which increases the user's > resource management costs. > (2) For new applications, since sysctl(net.ipv4.local_port_range) is > isolated across namespaces, the container cannot inherit the host's value, > so after startup, it remains at the kernel default value of 32768-60999, > which reduces the ease of use of the system. > > Solution > ======================================== > In order to maintain compatibility, we provide a sysctl interface in > host namespace, which makes it easy to tune local port range to > IANA specification. > > When ip_local_port_range_use_iana=1, the local port range of all network > namespaces is tuned to IANA specification (49152-60999), and IANA > specification is also used for newly created network namespaces. Therefore, > each container does not need to do sysctl settings separately, which > improves the convenience of configuration. > When ip_local_port_range_use_iana=0, the local port range of all network > namespaces are tuned to the original kernel defaults (32768-60999). > For example: > # cat /proc/sys/net/ipv4/ip_local_port_range > 32768 60999 > # echo 1 > /proc/sys/net/ipv4/ip_local_port_range_use_iana > # cat /proc/sys/net/ipv4/ip_local_port_range > 49152 60999 > > # unshare -n > # cat /proc/sys/net/ipv4/ip_local_port_range > 49152 60999 > > Notes > ======================================== > The lower value(49152), consistent with IANA dynamic port lower limit. > The upper limit value(60999), which differs from the IANA dynamic upper > limit due to the fact that Linux will use 61000-65535 as masquarading/NAT, > but this does not conflict with the IANA specification[3]. > > Note that following the above specification reduces the number of ephemeral > ports by half, increasing the risk of port exhaustion[2]. > > [1]:https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.txt > [2]:https://lore.kernel.org/all/bf42f6fd-cd06-02d6-d7b6-233a0602c437@xxxxxxxxx/ > [3]:https://lore.kernel.org/all/20070512210830.514c7709@xxxxxxxxxxxxxxxxx/ > > Co-developed-by: Kun Jiang <jiang.kun2@xxxxxxxxxx> > Signed-off-by: Fan Yu <fan.yu9@xxxxxxxxxx> > Signed-off-by: Kun Jiang <jiang.kun2@xxxxxxxxxx> > Reviewed-by: xu xin <xu.xin16@xxxxxxxxxx> > Reviewed-by: Yunkai Zhang <zhang.yunkai@xxxxxxxxxx> > Reviewed-by: Qiang Tu <tu.qiang35@xxxxxxxxxx> > Reviewed-by: Peilin He<he.peilin@xxxxxxxxxx> > Cc: Yang Yang <yang.yang29@xxxxxxxxxx> > --- > Documentation/networking/ip-sysctl.rst | 13 +++++++++++++ > net/ipv4/af_inet.c | 7 ++++++- > net/ipv4/sysctl_net_ipv4.c | 31 +++++++++++++++++++++++++++++++ > 3 files changed, 50 insertions(+), 1 deletion(-) > > diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst > index bd50df6a5a42..27f4928c2a1d 100644 > --- a/Documentation/networking/ip-sysctl.rst > +++ b/Documentation/networking/ip-sysctl.rst > @@ -1320,6 +1320,19 @@ ip_local_port_range - 2 INTEGERS > Must be greater than or equal to ip_unprivileged_port_start. > The default values are 32768 and 60999 respectively. > > +ip_local_port_range_use_iana - BOOLEAN > + Tune ip_local_port_range to IANA specification easily. > + When ip_local_port_range_use_iana=1, the local port range of > + all network namespaces is tuned to IANA specification (49152-60999), > + and IANA specification is also used for newly created network namespaces. > + Therefore, each container does not need to do sysctl settings separately, > + which improves the convenience of configuration. > + When ip_local_port_range_use_iana=0, the local port range of > + all network namespaces are tuned to the original kernel > + defaults (32768-60999). > + IANA means : Internet Assigned Numbers Authority It is very possible a future RFC changes the actual ranges. I would have used rfc 6335, because when a new rfc comes in 2030, we will have to add a new sysctl, right ? > + Default: 0 > + > ip_local_reserved_ports - list of comma separated ranges > Specify the ports which are reserved for known third-party > applications. These ports will not be used by automatic port > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index b24d74616637..42b6bc58dc45 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -123,6 +123,8 @@ > > #include <trace/events/sock.h> > > +extern u8 sysctl_ip_local_port_range_use_iana; > + > /* The inetsw table contains everything that inet_create needs to > * build a new socket. > */ > @@ -1802,7 +1804,10 @@ static __net_init int inet_init_net(struct net *net) > /* > * Set defaults for local port range > */ > - net->ipv4.ip_local_ports.range = 60999u << 16 | 32768u; > + if (sysctl_ip_local_port_range_use_iana) > + net->ipv4.ip_local_ports.range = 60999u << 16 | 49152u; > + else > + net->ipv4.ip_local_ports.range = 60999u << 16 | 32768u; > > seqlock_init(&net->ipv4.ping_group_range.lock); > /* > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index 162a0a3b6ba5..a38447889072 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -45,6 +45,8 @@ static unsigned int tcp_child_ehash_entries_max = 16 * 1024 * 1024; > static unsigned int udp_child_hash_entries_max = UDP_HTABLE_SIZE_MAX; > static int tcp_plb_max_rounds = 31; > static int tcp_plb_max_cong_thresh = 256; > +u8 sysctl_ip_local_port_range_use_iana; > +EXPORT_SYMBOL(sysctl_ip_local_port_range_use_iana); > > /* obsolete */ > static int sysctl_tcp_low_latency __read_mostly; > @@ -95,6 +97,26 @@ static int ipv4_local_port_range(struct ctl_table *table, int write, > return ret; > } > > +static int ipv4_local_port_range_use_iana(struct ctl_table *table, int write, > + void *buffer, size_t *lenp, loff_t *ppos) > +{ > + struct net *net; > + int ret; > + > + ret = proc_dou8vec_minmax(table, write, buffer, lenp, ppos); > + > + if (write && ret == 0) { > + for_each_net(net) { This is quite buggy. for_each_net() can only be used with care, otherwise list can be corrupted, netns can disappear under you.