On 13.02.2017 01:38, Borislav Petkov wrote:
On Sun, Feb 12, 2017 at 11:21:13PM +0100, Gabriel C wrote:
http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4
http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4
Thanks and interesting. In both cases, mcheck_init_device() doesn't
return or we don't see the "initcall returned" message.
Ok, let's try a silly sprinkling of printks in that function and try to
pinpoint how far we manage to come.
Apply, build, boot and shoot video again :-)
I didn't tested your patch yet but did a boot with mce=off and nomce which seems to not
really works since is still want to mc_device_add() even when off.
See :
http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_mce_off.jpg
I'll build an .10-rc8 with your patch tomorrow .. is somewhat late now here :)
Another thing is .. there seems to be a real bug in tsc code .
I've build an -rc8 with a lot more debug options on an now I see the following :
...
[ 4.321029] =================================
[ 4.321909] [ INFO: inconsistent lock state ]
[ 4.322789] 4.10.0-rc8-debug #1 Tainted: G I
[ 4.323879] ---------------------------------
[ 4.324759] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ 4.325973] cpuhp/0/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 4.326993] (tick_broadcast_lock){?.....}, at: [<ffffffff810f40e7>] tick_broadcast_control+0x57/0x190
[ 4.328879] {IN-HARDIRQ-W} state was registered at:
[ 4.329866] __lock_acquire+0x24f/0x19e0
[ 4.330675] lock_acquire+0xa5/0xd0
[ 4.331399] _raw_spin_lock_irqsave+0x54/0x90
[ 4.332297] tick_broadcast_switch_to_oneshot+0x11/0x50
[ 4.333371] tick_switch_to_oneshot+0x8c/0xd0
[ 4.334269] tick_init_highres+0x10/0x20
[ 4.335079] hrtimer_run_queues+0x5a/0xe0
[ 4.335907] run_local_timers+0x20/0x50
[ 4.336699] update_process_times+0x22/0x50
[ 4.337562] tick_periodic+0xa5/0xb0
[ 4.338302] tick_handle_periodic+0x1f/0x60
[ 4.378065] smp_trace_apic_timer_interrupt+0x74/0x90
[ 4.418107] smp_apic_timer_interrupt+0x9/0x10
[ 4.458095] apic_timer_interrupt+0x93/0xa0
[ 4.498048] mwait_idle+0x5a/0x90
[ 4.537618] arch_cpu_idle+0xa/0x10
[ 4.577098] default_idle_call+0x2c/0x30
[ 4.616211] do_idle+0x10c/0x1e0
[ 4.654606] cpu_startup_entry+0x5d/0x60
[ 4.692388] rest_init+0x12c/0x140
[ 4.729557] start_kernel+0x45f/0x46c
[ 4.766325] x86_64_start_reservations+0x2a/0x2c
[ 4.803075] x86_64_start_kernel+0xeb/0xf8
[ 4.839178] verify_cpu+0x0/0xfc
[ 4.874629] irq event stamp: 71
[ 4.909417] hardirqs last enabled at (71): [<ffffffff8176a887>] _raw_spin_unlock_irq+0x27/0x50
[ 4.945642] hardirqs last disabled at (70): [<ffffffff81762eaa>] __schedule+0x13a/0x7c0
[ 4.981797] softirqs last enabled at (0): [<ffffffff81060ef0>] copy_process+0x7c0/0x1ea0
[ 5.018580] softirqs last disabled at (0): [< (null)>] (null)
[ 5.055677]
other info that might help us debug this:
[ 5.072455] tsc: Refined TSC clocksource calibration: 2266.746 MHz
[ 5.072467] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x20ac7f6ecc6, max_idle_ns: 440795315461 ns
[ 5.202828] Possible unsafe locking scenario:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This seems to be the place where the other patch breaks hell here..
[ 5.276704] CPU0
[ 5.312400] ----
[ 5.347605] lock(tick_broadcast_lock);
[ 5.383163] <Interrupt>
[ 5.418457] lock(tick_broadcast_lock);
[ 5.454015]
*** DEADLOCK ***
[ 5.557982] no locks held by cpuhp/0/14.
[ 5.592295]
stack backtrace:
[ 5.657946] CPU: 0 PID: 14 Comm: cpuhp/0 Tainted: G I 4.10.0-rc8-debug #1
[ 5.690740] Hardware name: FUJITSU PRIMERGY TX200 S5 /D2709, BIOS 6.00 Rev. 1.14.2709 02/04/2013
[ 5.758323] Call Trace:
[ 5.791434] dump_stack+0x86/0xc1
[ 5.824421] print_usage_bug+0x283/0x2a0
[ 5.857357] mark_lock+0x39e/0x650
[ 5.890256] ? check_usage_forwards+0xf0/0xf0
[ 5.923436] __lock_acquire+0x2ba/0x19e0
[ 5.956617] ? pick_next_task_fair+0x350/0x700
[ 5.989903] ? finish_task_switch+0x184/0x220
[ 6.023171] ? debug_smp_processor_id+0x17/0x20
[ 6.056667] lock_acquire+0xa5/0xd0
[ 6.089882] ? tick_broadcast_control+0x57/0x190
[ 6.123395] ? smpboot_thread_fn+0x28/0x250
[ 6.156838] _raw_spin_lock+0x3c/0x80
[ 6.190175] ? tick_broadcast_control+0x57/0x190
[ 6.223914] tick_broadcast_control+0x57/0x190
[ 6.257846] ? finish_task_switch+0x184/0x220
[ 6.291900] ? smpboot_thread_fn+0x28/0x250
[ 6.325991] intel_idle_cpu_online+0x1d/0x100
[ 6.360220] cpuhp_invoke_callback+0x62/0x120
[ 6.394397] ? smpboot_thread_fn+0x28/0x250
[ 6.428451] cpuhp_thread_fun+0x87/0x110
[ 6.462611] smpboot_thread_fn+0x227/0x250
[ 6.496805] kthread+0x125/0x130
[ 6.530947] ? sort_range+0x20/0x20
[ 6.564126] ? kthread_stop+0x70/0x70
[ 6.597078] ret_from_fork+0x31/0x40
[ 6.630787] clocksource: Switched to clocksource tsc
...
Maybe Thomas has some ideas ?
Regards,
Gabriel C
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html