Re: [RFC PATCHv2 0/4] ipvs: Use kthreads for stats

Julian Anastasov <ja@xxxxxx> · Sat, 10 Sep 2022 01:23:19 +0300 (EEST)

	Hello,

On Fri, 9 Sep 2022, Jiri Wiesner wrote:

> On Fri, Sep 09, 2022 at 01:21:05AM +0300, Julian Anastasov wrote:
> > It is interesting to know what value for
> > IPVS_EST_TICK_CHAINS to use, it is used for the
> > IPVS_EST_MAX_COUNT calculation. We should determine
> > it from tests once the loops are in final form.
> > Now the limit increased a little bit to 38400.
> > Tomorrow I'll check again the patches for possible
> > problems.
> 
> I couldn't wait so I have run tests on various machines and used the sched_switch tracepoint to measure the time needed to process one chain. The table contains a median time for processing one chain, the maximum time measured, the median divided by the number of CPUs and the time needed to process one chain if there were 1024 CPUs of that type in a machine:
> > NR         CPU                       Time(ms)  Max(ms)  Time/CPU(ms)  1024 CPUs(ms)
> > 48 Intel Xeon CPU E5-2670 v3, 2 nodes   1.220    1.343         0.025     26.027
> > 64 Intel Xeon Gold 6326, 2 nodes        0.920    1.494         0.014     14.720
> > 192 Intel Xeon Gold 6330H, 4 nodes      3.957    4.153         0.021     21.104
> > 256 AMD EPYC 7713, 2 NUMA nodes         3.927    5.464         0.015     15.708
> >  80 ARM Neoverse-N1, 1 NUMA node        1.833    2.502         0.023     23.462
> > 128 ARM Kunpeng 920, 4 NUMA nodes       3.822    4.635         0.030     30.576
> I have to admit I was hoping the current IPVS_EST_CHAIN_DEPTH would work on machines with more than 1024 CPUs. If the max time values are used the time needed to process one chain on a 1024 CPU machine gets even closer to 40 ms, which it must not reach lest the estimates become inaccurate. I also have profiling data so I intend to look at the disassembly of ip_vs_estimation_kthread() to see which instructions take the most time. I will take a look at the v2 of the code on Monday.

	v2 uses find_next_bit in for_each_set_bit which has
cost. But we should not be surprised, if 268ms are for 50000
estimators on 104 CPUs (I guess this is also the number of
possible CPUs we actually use), one estimator reads from
104 CPUs for 5.36 microsecs, we can conclude for 1024 CPUs
the following:

Num Est		104 CPU		1024 CPU
========================================
1		5.36us		53us
4		21us		211us
16		86us		845us

	The v2 algorithm allows IPVS_EST_CHAIN_DEPTH to
be changed to var which we can determine based on the CPU
count, more CPUs will need more threads and we have CPUs
for them:

kd->chain_depth = max(1800 / num_possible_cpus(), 2);

Goals:
- chain time: sub-100 usec cond_resched rate
- tick time: 10% of max 40ms

CPUs	Depth	est_max_count	Chain Time	Tick Time
=========================================================
4	450	1080000		93us		4453us
16	112	268800		92us		4433us
104	17	40800		91us		4374us
1024	2	4800		106us		5066us
4096	2	4800		422us		20265us

Summary:

- For 4096 CPUs we can start 208 kthreads for 1000000 ests,
crazy :)

- 4096 CPUs need to be fast to go below these 20ms or we
should use chain with 1 estimator for 2048+ CPUs

	If we track somehow when a stats is updated,
may be we can skip estimators that are idle for
some time, this can save CPU cycles for estimating
unused dests.

	Also, I'm investigating the idea to use
task_rlimit(current, RLIMIT_NPROC) as kthread limit when
first service is added and to save it into
ipvs->est_max_threads.

	Another idea is ip_vs_estimation_kthread not to
change add_row but ip_vs_start_estimator to consider instead
est_row for the same purpose but only when kd->est_count
becomes large, say 2 * IPVS_EST_TICK_CHAINS * kd->chain_depth.
The idea is to fill 2 ticks completely when small number
of estimators are added and to prefer est_row when
we exceed this threshold and prefer to spread the
estimators to more ticks by honouring the 2-second
initial timer.

	For example:

	if (kd->est_count >= 2 * IPVS_EST_TICK_CHAINS *
	    kd->chain_depth)
		crow = READ_ONCE(kd->est_row);
	else
		crow = READ_ONCE(kd->add_row);
	crow--;
	...

Regards

--
Julian Anastasov <ja@xxxxxx>