Hello, On Fri, 12 Aug 2022, Jiri Wiesner wrote: > The calculation of rate estimates for IPVS services and destinations will > cause an increase in scheduling latency to hundreds of milliseconds when > the number of estimators reaches tens of thousands or more. This issue has > been reported upstream [1]. Design changes to the algorithm to compute the > estimates were proposed in the same email thread. > > By implementing some of the proposed design changes, this patch seeks to > address the latency issue by dividing the estimators into groups for which > estimates are calculated in a 2-second interval (same as before). Each of > the groups is processed once in each 2-second interval. Instead of > allocating an array of lists, groups are identified by their group_id, > which has the advantage that estimators can stay in the same list to which > they have been added by ip_vs_start_estimator(). The implementation of > estimator grouping is able to scale up with an increasing number of > estimators as well as scale down when estimators are being removed. > The changes to group size can be monitored with dynamic debugging: > echo 'file net/netfilter/ipvs/ip_vs_est.c +pfl' >> /sys/kernel/debug/dynamic_debug/control > > Rebalacing of estimator groups is implemented and can be triggered only > after all the calculations for a 2-second interval have finished. After a > limit is exceeded, adding or removing estimators will triger rebalacing, > which will cause estimates to be inaccurate in the next 2-second interval. > For example, removing estimators that results in the removal of an entire > group will shorten the time interval used for computing rates, which will > lead to the rates being underestimated in the next 2-second interval. > > Testing was carried out on a 2-socket machine with Intel Xeon Gold 6326 > CPUs (64 logical CPUs). Tests with up to 600,000 estimators were > successfully completed. The expectation is that, given the current default > limits, the implementation can handle 150,000 estimators on most machines > in use today. In a test with 100,000 estimators, the default group size of > 1024 estimators resulted in the processing time for one group to be circa > 2.3 milliseconds and a timer period of 5 jiffies. Despite estimators being > added or removed throughout most of the test, the overhead of > ip_vs_estimator_rebalance() was less than 10% of the overhead of > estimation_timer(): > 7.66% 124093 swapper [kernel.kallsyms] [k] intel_idle > 2.86% 14296 ipvsadm [kernel.kallsyms] [k] native_queued_spin_lock_slowpath > 2.64% 16827 ipvsadm [kernel.kallsyms] [k] ip_vs_genl_parse_service > 2.15% 18457 ipvsadm libc-2.31.so [.] _dl_addr > 2.08% 4562 ipvsadm [kernel.kallsyms] [k] ip_vs_genl_dump_services > 2.06% 18326 ipvsadm ld-2.31.so [.] do_lookup_x > 1.78% 17251 swapper [kernel.kallsyms] [k] estimation_timer > ... > 0.14% 855 swapper [kernel.kallsyms] [k] ip_vs_estimator_rebalance > > The intention is to develop this RFC patch into a short series addressing > the design changes proposed in [1]. Also, after moving the rate estimation > out of softirq context, the whole estimator list could be processed > concurrently - more than one work item would be used. > > [1] https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@xxxxxxxxx > > Signed-off-by: Jiri Wiesner <jwiesner@xxxxxxx> Other developers tried solutions with workqueues but so far we don't see any results. Give me some days, may be I can come up with solution that uses kthread(s) to allow later nice/cpumask cfg tuning and to avoid overload of the system workqueues. Regards -- Julian Anastasov <ja@xxxxxx>