Dear kernelnewbies, In one of the dedicated servers that our company has for shared web hosting we have had a series of log lines: Jun 3 00:33:53 hostname kernel: INFO: rcu_sched self-detected stall on CPU { 0} (t=15001 jiffies g=66854079 c=66854078 q=686736) Jun 3 00:33:53 hostname kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 1, t=15002 jiffies, g=66854079, c=66854078, q=686736) The lines appeared every 3 minutes from 00:33:54 to 05:12:33 with the last line being: Jun 3 05:12:33 ns25 kernel: INFO: rcu_sched self-detected stall on CPU { 0} (t=4155459 jiffies g=66854079 c=66854078 q=10628601) Jun 3 05:12:33 ns25 kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 1, t=4155460 jiffies, g=66854079, c=66854078, q=10628601) with all of them there are traces of CPU states with the kernel function trace, CPU registers, etc. Some more info about this: - The problem ended silently as it started, no server restart took place, uptime still shows plenty of time. - The value in t=nnnnnn seems to be the time elapsed in this state in Jiffies. 4155460 = 16621 sg = 4h 36min, roughly the time difference between 05:12 and 00:33 - Kernel 3.10 from our datacenter. Dumping /proc/config.gz I obtain the following info: # RCU Subsystem CONFIG_TREE_RCU=y # CONFIG_PREEMPT_RCU is not set CONFIG_RCU_STALL_COMMON=y # CONFIG_RCU_USER_QS is not set CONFIG_RCU_FANOUT=64 CONFIG_RCU_FANOUT_LEAF=16 # CONFIG_RCU_FANOUT_EXACT is not set # CONFIG_RCU_FAST_NO_HZ is not set # CONFIG_TREE_RCU_TRACE is not set # CONFIG_RCU_NOCB_CPU is not set # RCU Debugging # CONFIG_SPARSE_RCU_POINTER is not set CONFIG_RCU_CPU_STALL_TIMEOUT=60 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y # CONFIG_CGROUP_SCHED is not set # CONFIG_SCHED_AUTOGROUP is not set CONFIG_DEFAULT_IOSCHED="deadline" CONFIG_SCHED_OMIT_FRAME_POINTER=y CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y CONFIG_SCHED_HRTICK=y CONFIG_NET_SCHED=y # Queueing/Scheduling # CONFIG_USB_EHCI_TT_NEWSCHED is not set CONFIG_PREEMPT_NOTIFIERS=y CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set My questions are ---------------- 1. Am I right assuming that the whole CPU was stalled for all that period? We have proof that the machine was still working (log info among it), but maybe it was only working with one core. 2. The task that was running on the stalled core, was it trapped without running the whole time? With the log lines above I have always had this line: kernel: CPU: 0 PID: 1288 Comm: php-cgi Not tainted 3.10.23-xxxx-std-ipv6-64 #1 3. Can there be other tasks trapped? AFAIK modern Linux 3.10 does not keep separate run-queues for each core, right? 4. What are the meaning of the t=, g= and c= values in the rcu_sched log? t=4155459 jiffies g=66854079 c=66854078 q=10628601 AFAICT, t= represents the time elapsed since this situation started. g= and c= are constant values and q= is an increasing figure. 5. Any ideas of where can this come from? Here is the trace if it helps: [<ffffffff8160806c>] __const_udelay+0x2c/0x30 [<ffffffff8106338a>] arch_trigger_all_cpu_backtrace+0x6a/0xa0 [<ffffffff81115cdd>] rcu_check_callbacks+0x2ed/0x560 [<ffffffff81117c17>] ? acct_account_cputime+0x17/0x20 [<ffffffff810e97df>] ? account_system_time+0xcf/0x180 [<ffffffff810ca2c3>] update_process_times+0x43/0x80 [<ffffffff810f95c1>] tick_sched_handle.isra.12+0x31/0x40 [<ffffffff810f9704>] tick_sched_timer+0x44/0x70 [<ffffffff810df07a>] __run_hrtimer.isra.29+0x4a/0xd0 [<ffffffff810df9b5>] hrtimer_interrupt+0xf5/0x230 [<ffffffff810626c4>] smp_apic_timer_interrupt+0x64/0xa0 [<ffffffff81d421ca>] apic_timer_interrupt+0x6a/0x70 <EOI> [<ffffffff81183e80>] ? switch_names+0xc0/0xc0 [<ffffffff81d409ee>] ? _raw_spin_lock+0xe/0x30 [<ffffffff81184eb3>] shrink_dentry_list+0x43/0x1f0 [<ffffffff811854b7>] prune_dcache_sb+0x137/0x160 [<ffffffff81173ab8>] prune_super+0x168/0x1a0 [<ffffffff81131119>] shrink_slab+0x129/0x1e0 [<ffffffff811331f8>] ? shrink_zone+0xa8/0xd0 [<ffffffff811337af>] do_try_to_free_pages+0x31f/0x410 [<ffffffff811339dd>] try_to_free_pages+0x9d/0xc0 [<ffffffff8112b96e>] __alloc_pages_nodemask+0x4de/0x840 [<ffffffff81159560>] alloc_pages_vma+0xa0/0x1a0 [<ffffffff81142b67>] handle_pte_fault+0x7b7/0x9f0 [<ffffffff81146906>] ? vma_gap_update+0x26/0x30 [<ffffffff8113e293>] ? anon_vma_interval_tree_insert+0x83/0x90 [<ffffffff81143047>] handle_mm_fault+0x147/0x1e0 [<ffffffff8106c391>] __do_page_fault+0x151/0x4b0 [<ffffffff811478ff>] ? vma_merge+0x1ff/0x320 [<ffffffff81148a4b>] ? do_brk+0x1bb/0x320 [<ffffffff8106c719>] do_page_fault+0x9/0x10 [<ffffffff81d41172>] page_fault+0x22/0x30 Any ideas will be welcome. _______________________________________________ Kernelnewbies mailing list Kernelnewbies@xxxxxxxxxxxxxxxxx http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies