Hi Yunhong & Julian, any updates ?
We've encountered the same problem. With lots of ipvs
services plus many CPUs, it's easy to reproduce this issue.
I have a simple script to reproduce:
First add many ipvs services:
for((i=0;i<50000;i++)); do
ipvsadm -A -t 10.10.10.10:$((2000+$i))
done
Then, check the latency of estimation_timer() using bpftrace:
#!/usr/bin/bpftrace
kprobe:estimation_timer {
@enter = nsecs;
}
kretprobe:estimation_timer {
$exit = nsecs;
printf("latency: %ld us\n", (nsecs - @enter)/1000);
}
I observed about 268ms delay on my 104 CPUs test server.
Attaching 2 probes...
latency: 268807 us
latency: 268519 us
latency: 269263 us
And I tried moving estimation_timer() into a delayed
workqueue, this do make things better. But since the
estimation won't give up CPU, it can run for pretty
long without scheduling on a server which don't have
preempt enabled, so tasks on that CPU can't get executed
during that period.
Since the estimation repeated every 2s, we can't call
cond_resched() to give up CPU in the middle of iterating the
est_list, or the estimation will be quite inaccurate.
Besides the est_list needs to be protected.
I haven't found any ideal solution yet, currently, we just
moved the estimation into kworker and add sysctl to allow
us to disable the estimation, since we don't need the
estimation anyway.
Our patches is pretty simple now, if you think it's useful,
I can paste them
Do you guys have any suggestions or solutions ?
Thanks a lot !
Dust
On 4/18/20 12:56 AM, yunhong-cgl jiang wrote:
Thanks for reply.
Yes, our patch changes the est_list to a RCU list. Will do more testing and send out the patch.
Thanks
—Yunhong
On Apr 17, 2020, at 12:47 AM, Julian Anastasov <ja@xxxxxx> wrote:
Hello,
On Thu, 16 Apr 2020, yunhong-cgl jiang wrote:
Hi, Simon & Julian,
We noticed that on our kubernetes node utilizing IPVS, the estimation_timer() takes very long (>200sm as shown below). Such long delay on timer softirq causes long packet latency.
<idle>-0 [007] dNH. 25652945.670814: softirq_raise: vec=1 [action=TIMER]
.....
<idle>-0 [007] .Ns. 25652945.992273: softirq_exit: vec=1 [action=TIMER]
The long latency is caused by the big service number (>50k) and large CPU number (>80 CPUs),
We tried to move the timer function into a kernel thread so that it will not block the system and seems solves our problem. Is this the right direction? If yes, we will do more testing and send out the RFC patch. If not, can you give us some suggestion?
Using kernel thread is a good idea. For this to work, we can
also remove the est_lock and to use RCU for est_list.
The writers ip_vs_start_estimator() and ip_vs_stop_estimator() already
run under common mutex __ip_vs_mutex, so they not need any
synchronization. We need _bh lock usage in estimation_timer().
Let me know if you need any help with the patch.
Regards
--
Julian Anastasov <ja@xxxxxx>