From: Ramesh Thomas <ramesh.thomas@xxxxxxxxx> Hello, This addresses an issue we have been facing with preempt_rt kernels not able to enter nohz_full state consistently. Following are my debug findings and details of a tool I had devloped that can help reproduce the issue. Following patch has a proposed fix or at least a pointer to areas worth looking into. We need preempt_rt for determinism and being able to use nohz_full along with it is very valuable. Problem: Sometimes nohz_full state is never entered even when all necessary conditions are met. It is easier to reproduce the issue in preempt_rt kernel, however it may not be limited to preempt_rt. Debug findings and proposed fix: Observed that in the failure condition, entry into nohz_full state is repeatedly aborted due to the detection of a pending timer event in the next period in tick_nohz_next_event(). The issue is not reproduceable if tick stoppage is not bailed out here. The skipping of the bailing out code is done only if CONFIG_NO_HZ_FULL is defined. Since in nohz_full mode, idle state is not entered when ticks are being stopped, aborting tick stoppage may not be necessary. It is simpler to let the common code that handles reprogramming of the timer at tick_nohz_stop_tick() handle the next tick. Environment to reproduce: Used Intel NUCs with 4 cores (Apollo Lake and Tiger Lake). Easier to reproduce in embedded platforms. Kernel version:5.10.1-rt19 (This is not a new issue and I have seen it in rt kernel version 4.17) Relevant kernel config flags: - NO_HZ_FULL=y - PREEMPT_RT=y - CPU_ISOLATION=y - RCU_NOCB_CPU=y Relevant kernel boot parameters: - isolcpus=nohz,domain,1,3 nohz_full=1,3 rcu_nocbs=1,3 irqaffinity=0 - cpufreq.off=1 idle=poll cpuidle.off=1 Steps to reproduce: 1. Disable rt throttling assigning 100% scheduler period to rt tasks 2. Set cpu affinity to one of the nohz_full cpus 3. Set scheduling policy to sched_fifo with max priority 4. Wait till "tick_stopped" gets set in /proc/timer_list 5. Return failure if tick is not stopped in 15 seconds 6. Run above steps in a loop to stress it. It may take a while to reproduce. The above can be done using a tool I had developed as part of a framework to assist setting up CPU thread isolation and measuring jitter. It can be found at https://github.com/intel/tif Build the tif_test app and run as follows using the included script to run it in a loop till the failure is reproduced. make test ./tif_stress.sh e.g. output. "Test# 818 Successfully entered nohz state in 724us Error entering nohz state after 15000102us Reproduced NOHZ_FULL failure after 818 tries!!! Test elapsed time: 835 seconds" (PS: The framework has a workaround for the issue which is not used in the test. The workaround that helped was, switching CPU affinity in and out of the nohz_full CPU giving the scheduler a fresh start) Ramesh Thomas (1): dynticks/preempt_rt: Fix a nohz_full entry failure in preempt_rt kernel/time/tick-sched.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) -- 2.26.2