On 10/08/15 18:56, Christoph Biedl wrote: > Eric Dumazet wrote... > > [ commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af ] > >> It definitely should help ! > > Yesterday, I've experienced issues somewhat similar to this, but I'm > not entirely sure: > > Four of five systems running 4.1.9 stopped working. No reaction on > network, keyboard, serial console. In one case, the stack trace as > below made it to the loghost. > > Two things are quite different. First, the systems had a reasonable > uptime, about a week. > > And second, the scary part: All incidents happened within a rather > short time span of three minutes the most, beginning after 16:41:28 and > before 16:41:54 UTC. So I assumed a brownout first - until I realized > the systems faded away at slightly different times, and one is at a > different location. While other systems using different kernel versions > continued to operate on both sites. > > So, I'd be glad for answers for > > - Is this the same issue or should I be even more afraid? There's always room for more. :-) > - What might be the reason for this temporal coincidence? I have no > plausible idea. More bugs? > Confused, > Christoph > > > INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491) > rcu_sched kthread starved for 5999 jiffies! > Task dump for CPU 3: > swapper/3 R running task 0 0 1 0x00000008 > ffffffff81e396c0 ffff88042dcc3b20 ffffffff810807da 0000000000000003 > ffffffff81e396c0 ffff88042dcc3b40 ffffffff81083b78 ffff88042dcc3b80 > 0000000000000003 ffff88042dcc3b70 ffffffff810a945c ffff88042dcd5740 > Call Trace: > <IRQ> [<ffffffff810807da>] sched_show_task+0xaa/0x110 > [<ffffffff81083b78>] dump_cpu_task+0x38/0x40 > [<ffffffff810a945c>] rcu_dump_cpu_stacks+0x8c/0xc0 > [<ffffffff810abf31>] rcu_check_callbacks+0x3b1/0x680 > [<ffffffff810e7bb7>] ? acct_account_cputime+0x17/0x20 > [<ffffffff8108484e>] ? account_system_time+0x8e/0x180 > [<ffffffff810ae4d3>] update_process_times+0x33/0x60 > [<ffffffff810bcae0>] tick_sched_handle.isra.14+0x30/0x40 > [<ffffffff810bcbd3>] tick_sched_timer+0x43/0x80 > [<ffffffff810aea2a>] __run_hrtimer.isra.32+0x4a/0xd0 > [<ffffffff810af225>] hrtimer_interrupt+0xd5/0x1f0 > [<ffffffff81034d84>] local_apic_timer_interrupt+0x34/0x60 > INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491) > rcu_sched kthread starved for 5999 jiffies! > Task dump for CPU 3: > swapper/3 R running task 0 0 1 0x00000008 > ffffffff81e396c0 ffff88042dcc3b20 ffffffff810807da 0000000000000003 > ffffffff81e396c0 ffff88042dcc3b40 ffffffff81083b78 ffff88042dcc3b80 > 0000000000000003 ffff88042dcc3b70 ffffffff810a945c ffff88042dcd5740 > Call Trace: > <IRQ> [<ffffffff810807da>] sched_show_task+0xaa/0x110 > [<ffffffff81083b78>] dump_cpu_task+0x38/0x40 > [<ffffffff8103516c>] smp_apic_timer_interrupt+0x3c/0x60 > [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70 > [<ffffffff8190c8a9>] ? _raw_spin_unlock_irqrestore+0x9/0x10 > [<ffffffff810ade58>] try_to_del_timer_sync+0x48/0x60 > [<ffffffff810adeb2>] ? del_timer_sync+0x42/0x60 > [<ffffffff810adeba>] del_timer_sync+0x4a/0x60 > [<ffffffff8178b7da>] inet_csk_reqsk_queue_drop+0x7a/0x1f0 > [<ffffffff8178ba7f>] reqsk_timer_handler+0x12f/0x290 > [<ffffffff8178b950>] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0 > [<ffffffff810ad9e6>] call_timer_fn.isra.26+0x26/0x80 > [<ffffffff810a945c>] rcu_dump_cpu_stacks+0x8c/0xc0 > [<ffffffff810abf31>] rcu_check_callbacks+0x3b1/0x680 > [<ffffffff810e7bb7>] ? acct_account_cputime+0x17/0x20 > [<ffffffff8108484e>] ? account_system_time+0x8e/0x180 > [<ffffffff810ae4d3>] update_process_times+0x33/0x60 > [<ffffffff810bcae0>] tick_sched_handle.isra.14+0x30/0x40 > [<ffffffff810bcbd3>] tick_sched_timer+0x43/0x80 > [<ffffffff810aea2a>] __run_hrtimer.isra.32+0x4a/0xd0 > [<ffffffff810af225>] hrtimer_interrupt+0xd5/0x1f0 > [<ffffffff81034d84>] local_apic_timer_interrupt+0x34/0x60 > [<ffffffff810ae1ae>] run_timer_softirq+0x18e/0x220 > [<ffffffff81060b1a>] __do_softirq+0xda/0x1f0 > [<ffffffff81060e16>] irq_exit+0x76/0xa0 > [<ffffffff81035175>] smp_apic_timer_interrupt+0x45/0x60 > [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70 > <EOI> [<ffffffff810844be>] ? sched_clock_cpu+0x9e/0xb0 > [<ffffffff8100bc15>] ? amd_e400_idle+0x35/0xd0 > [<ffffffff8100bc13>] ? amd_e400_idle+0x33/0xd0 > [<ffffffff8100c42a>] arch_cpu_idle+0xa/0x10 > [<ffffffff810929e3>] cpu_startup_entry+0x2c3/0x330 > [<ffffffff8103516c>] smp_apic_timer_interrupt+0x3c/0x60 > [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70 > [<ffffffff8190c8a9>] ? _raw_spin_unlock_irqrestore+0x9/0x10 > [<ffffffff810ade58>] try_to_del_timer_sync+0x48/0x60 > [<ffffffff810adeb2>] ? del_timer_sync+0x42/0x60 > [<ffffffff810adeba>] del_timer_sync+0x4a/0x60 > [<ffffffff8178b7da>] inet_csk_reqsk_queue_drop+0x7a/0x1f0 > [<ffffffff8178ba7f>] reqsk_timer_handler+0x12f/0x290 > [<ffffffff8178b950>] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0 > [<ffffffff810ad9e6>] call_timer_fn.isra.26+0x26/0x80 > [<ffffffff810332dc>] start_secondary+0x17c/0x1a0 > [<ffffffff810ae1ae>] run_timer_softirq+0x18e/0x220 > [<ffffffff81060b1a>] __do_softirq+0xda/0x1f0 > [<ffffffff81060e16>] irq_exit+0x76/0xa0 > [<ffffffff81035175>] smp_apic_timer_interrupt+0x45/0x60 > [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70 > <EOI> [<ffffffff810844be>] ? sched_clock_cpu+0x9e/0xb0 > [<ffffffff8100bc15>] ? amd_e400_idle+0x35/0xd0 > [<ffffffff8100bc13>] ? amd_e400_idle+0x33/0xd0 > [<ffffffff8100c42a>] arch_cpu_idle+0xa/0x10 > [<ffffffff810929e3>] cpu_startup_entry+0x2c3/0x330 > [<ffffffff810332dc>] start_secondary+0x17c/0x1a0 > The timer fixes were followups to a patch that went into 4.1 called "tcp/dccp: get rid of central timewait timer", and it seems there were a few more patches in that area very recently. So after some git spelunking I am now running with the following patches on top of 4.1.10 + 83fccfc3940.. (for the lockups), in the following order: fc01538f9fb75572c969ca9988176ffc2a8741d6 simplify timewait refcounting dbe7faa4045ea83a37b691b12bb02a8f86c2d2e9 inet_twsk_deschedule factorization 29c6852602e259d2c1882f320b29d5c3fec0de04 fix races in reqsk_queue_hash_req() ed2e923945892a8372ab70d2f61d364b0b6d9054 fix timewait races in timer handling They may not all be required for the particular problem you just summoned, but (from what I could tell) are required to apply everything properly. They certainly can't make things worse. :-) Oh and while you're at it you can apply these l33t cubic fixes :-) 30927520dbae297182990bb21d08762bcc35ce1d better follow cubic curve after idle period c2e7204d180f8efc80f27959ca9cf16fa17f67db do not set epoch_start in the future I've been running these on 3 machines for almost 10 minutes without issue, so they are totally safe to go into production right away. -h -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html