Hello, On Thu, 3 Dec 2020, dust.li wrote: > Hi Yunhong & Julian, any updates ? > > > We've encountered the same problem. With lots of ipvs > > services plus many CPUs, it's easy to reproduce this issue. > > I have a simple script to reproduce: > > First add many ipvs services: > > for((i=0;i<50000;i++)); do > ipvsadm -A -t 10.10.10.10:$((2000+$i)) > done > > > Then, check the latency of estimation_timer() using bpftrace: > > #!/usr/bin/bpftrace > > kprobe:estimation_timer { > @enter = nsecs; > } > > kretprobe:estimation_timer { > $exit = nsecs; > printf("latency: %ld us\n", (nsecs - @enter)/1000); > } > > I observed about 268ms delay on my 104 CPUs test server. > > Attaching 2 probes... > latency: 268807 us > latency: 268519 us > latency: 269263 us > > > And I tried moving estimation_timer() into a delayed > > workqueue, this do make things better. But since the > > estimation won't give up CPU, it can run for pretty > > long without scheduling on a server which don't have > > preempt enabled, so tasks on that CPU can't get executed > > during that period. > > > Since the estimation repeated every 2s, we can't call > > cond_resched() to give up CPU in the middle of iterating the > > est_list, or the estimation will be quite inaccurate. > > Besides the est_list needs to be protected. > > > I haven't found any ideal solution yet, currently, we just > > moved the estimation into kworker and add sysctl to allow > > us to disable the estimation, since we don't need the > > estimation anyway. > > > Our patches is pretty simple now, if you think it's useful, > > I can paste them > > > Do you guys have any suggestions or solutions ? When I saw the report first time, I thought on this issue and here is what I think: - single delayed work (slow to stop them if using many) - the delayed work walks say 64 lists with estimators and reschedules itself for the next 30ms, as result, 30ms*64=1920ms, 80ms will be idle up to the 2000ms period for estimation for every list. As result, every list is walked once per 2000ms. If 30ms is odd for all kind of HZ values, this can be 20ms * 100. - work will use spin_lock_bh(&s->lock) to protect the entries, we do not want delays between /proc readers and the work if using mutex. But _bh locks stop timers and networking for short time :( Not sure yet if just spin_lock is safe for both /proc and estimator's work. - while walking one of the 64 lists we should use just rcu read locking for the current list, no cond_resched_rcu because we do not know how to synchronize while entries are removed at the same time. For now using array with 64 lists solves this problem. - the algorith will look like this: int row = 0; for () { rcu_read_lock(); list_for_each_entry_rcu(e, &ipvs->est_list[row], list) { ... /* Should we stop immediately ? */ if (!ipvs->enable || stopping delayed work) { rcu_read_unlock(); goto out; } } /* rcu_read_unlock will reschedule if needed * but we know where to start from the next time, * i.e. from next slot */ rcu_read_unlock(); reschedule delayed work for +30ms or +110ms if last row by using queue_delayed_work*() row ++; if (row >= 64) row = 0; } out: - the drawback is that without cond_resched_rcu we are not fair to other processes, solution is needed, we just reduce the problem by using 64 lists which can be imbalanced. - next goal: entries should be removed with list_del_rcu, without any locks, we do not want to delay configurations, ipvs->est_lock should be removed. - now the problem is how to spread the entries to 64 lists. One way is randomly, in this case first estimation may be for less than 2000ms. In this case, less entries means less work for all 64 steps. But what if entries are removed and some slots become empty and others too large? - if we want to make the work equal for all 64 passes, we can rebalance the lists, say on every 2000ms move some entries to neighbour list. As result, one entry can be estimated after 1970ms or 2030ms. But this is complication not possible with the RCU locking, we need a safe way to move entries to neighbour list, say if work walks row 0 we can rebalance between rows 32 and 33 which are 1 second away of row 0. And not all list primitives allow it for _rcu. - next options is to insert entries in some current list, if their count reaches, say 128, then move to the next list for inserting. This option tries to provide exact 2000ms delay for the first estimation for the newly added entry. We can start with some implementation and see if your tests are happy. Regards -- Julian Anastasov <ja@xxxxxx>