Hello, On Thu, 13 Apr 2023, Abhijeet Rastogi wrote: > Hi Simon, Andrea and Julian, > > I really appreciate you taking the time to respond to my patch. Some follow up > questions that I'll appreciate a response for. > > @Simon Horman > >In any case, I think this patch is an improvement on the current situation. > > +1 to this. I wanted to add that, we're not changing the defaults > here, the default still stays at 2^12. If a kernel user changes the > default, they probably already know what the limitations are, so I > personally don't think it is a big concern. > > @Andrea Claudi > >for the record, RHEL ships with CONFIG_IP_VS_TAB_BITS set to 12 as > default. > > Sorry, I should have been clearer. RHEL ships with the same default, > yes, but it doesn't have the range check, at least, on the version I'm > using right now (3.10.0-1160.62.1.el7.x86_64). > > On this version, I'm able to load with bit size 30, 31 gives me error > regarding allocating memory (64GB host) and anything beyond 31 is > mysteriously switched to a lower number. The following dmesg on my > host confirms that the bitsize 30 worked, which is not possible > without a patch on the current kernel version. > > "[Fri Apr 14 01:14:51 2023] IPVS: Connection hash table configured (size=1073741 > 824, memory=16777216Kbytes)" > > @Julian Anastasov, > >This is not a limit of number of connections. I prefer > not to allow value above 24 without adding checks for the > available memory, > > Interesting that you brought up that number 24, that is exactly what > we use in production today. One IPVS node is able to handle spikes of > 10M active connections without issues. This patch idea originated as > my company is migrating from the ancient RHEL version to a somewhat > newer CentOS (5.* kernel) and noticed that we were unable to load the > ip_vs kernel module with anything greater than 20 bits. Another > motivation for kernel upgrade is utilizing maglev to reduce table size > but that's out of context in this discussion. > > My request is, can we increase the range from 20 to something larger? > If 31 seems a bit excessive, maybe, we can settle for something like > [8,30] or even lower. With conn_tab_bits=30, it allocates 16GB at > initialization time, it is not entirely absurd by today's standards. > > I can revise my patch to a lower range as you guys see fit. Some 32-bit platforms have a 120MB limit for vmalloc. 24-bit table on 32-bit box will allocate 64MB. One way to solve the problem is to use in Kconfig: range 8 20 if !64BIT range 8 27 if 64BIT Why 30 and above do not work? Because we store the size, mask in 'int' which is 32 bits. But also some places do not allow allocations above INT_MAX, for example, kvmalloc_node(). So, even 28 may not work for 8-byte array items on 64-bit. It would be good to check if the provided value does not exceed some real limits. Here is an example that assumes IPVS will allocate up to 1/8 of the memory, 8 conns average in a hash row. Such checks should not exceed the small vmalloc area for 32-bit boxes and also kvmalloc allows vmalloc with huge pages. This idea is entirely untested/compiled. These checks apply some sane thresholds. If you need something above, you are probably allocating more than needed. /* This will match the Kconfig range: */ int min = 8; #if __BITS_PER_LONG > 32 int max = 27; #else int max = 20; #endif We can safely use 27 in Kconfig even for 32-bit due to the below checks, they will clamp it to lower value. /* Order of the available memory */ int max_avail = order_base_2(totalram_pages()) + PAGE_SHIFT; We can remove this 'if' check: if (ip_vs_conn_tab_bits < 8 || ip_vs_conn_tab_bits > 20) { pr_info("conn_tab_bits not in [8, 20]. Using default value\n"); ip_vs_conn_tab_bits = CONFIG_IP_VS_TAB_BITS; } max_avail -= 3; /* ~8 in hash row */ max_avail -= 3; /* IPVS up to 1/8 of mem */ /* The hash table links allocated memory for IPVS conns */ max_avail -= order_base_2(sizeof(struct ip_vs_conn)); /* Range should not exceed the available memory */ max = clamp(max, min, max_avail); /* Clamp configured value silently */ ip_vs_conn_tab_bits = clamp_val(ip_vs_conn_tab_bits, min, max); ip_vs_conn_tab_size = 1 << ip_vs_conn_tab_bits; ip_vs_conn_tab_mask = ip_vs_conn_tab_size - 1; /* Switch to kvmalloc */ ip_vs_conn_tab = kvmalloc_array(ip_vs_conn_tab_size, sizeof(*ip_vs_conn_tab), GFP_KERNEL); and use everywhere kvfree(ip_vs_conn_tab); For 64GB box the calcs should be: max_avail = 36 - 3 - 3 - 9 => ip_vs_conn_tab_bits = 21 Allocated hash table: (2^21)*8=16MB Allocated for IPVS conns (8 cols per row): (2^21)*8*(400..512)=6..8GB which is ~1/8 of 64GB. All memory will be allocated with ~64 conns per row. May be the above calcs can be changed to ~4 cols and 1/2 mem to use 128MB (24 bits instead of 21) for our example: 36 - 2 - 1 - 9 => 24. Possible problems if using large table that is not loaded with enough conns: - walking the table will cost more cycles, for example, ip_vs_random_dropentry() wants to walk part of the table every second. Even normal netns cleanup has to walk it. - cat /proc/net/ip_vs_conn will be slower Regards -- Julian Anastasov <ja@xxxxxx>