On 1/17/2018 1:22 AM, Dave Young wrote: > [Modify the subject since this is a new problem, original io vector > issue has been fixed with one commit from Thomas] > > Add more cc according to below old discussion: > https://lkml.org/lkml/2017/7/27/574 > > Tom, I'm not sure why you finally did not dynamically run wbinvd? That discussion was aimed at the wbinvd that was being performed in arch/x86/kernel/relocate_kernel_64.S, which is dynamically run based on a flag. > On 01/04/18 at 11:15am, Dave Young wrote: >> On 12/14/17 at 05:24pm, Dave Young wrote: >>> On 12/13/17 at 11:57pm, Yu Chen wrote: >>>> On Wed, Dec 13, 2017 at 10:52:56AM +0800, Dave Young wrote: >>>>> Hi, >>>>> >>>>> Kexec reboot and kdump has broken on my laptop for long time with >>>>> 4.15.0-rc1+ kernels. With the patch below an early panic been fixed: >>>>> https://patchwork.kernel.org/patch/10084289/ >>>>> >>>>> But still can not get a successful reboot, it looked like graphic >>>>> issue, but after bisecting the kernel, I got below: >>>>> >>>>> [dyoung@dhcp-*-* linux]$ git bisect good >>>>> There are only 'skip'ped commits left to test. >>>>> The first bad commit could be any of: >>>>> 2db1f959d9dc16035f2eb44ed5fdb2789b754d6a >>>>> 4900be83602b6be07366d3e69f756c1959f4169a >>>>> We cannot bisect more! >>>>> >>>>> These two commits can no be reverted because of code conflicts, thus >>>>> I reverted the whole series from Thomas (below commits), with those >>>>> x86/vector changes reverted, kexec reboot works fine. >>>>> >>>>> Could you help to take a look, any thoughts? I can do the test >>>>> if you have some debug patch to try. >>>> Is it possible that the "second" kernel runs on non-zero CPU? If yes, >>>> what if some irqs are only delivered to cpu0? (use cpumask_of(0) >>>> directly) >>> >>> Thanks for the reply. >>> >>> For kdump, yes, for kexec, I'm not sure. >>> >>> Here is some kexec kernel boot log: >>> http://people.redhat.com/~ruyang/misc/kexec-regression.txt >>> >>> Copy the lockup call trace here: >>> [ 23.779285] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0 >>> [ 23.779285] Modules linked in: arc4 rtsx_pci_sdmmc i915 iwlmvm kvm_intel mac8 >>> 0211 kvm irqbypass btusb btrtl btbcm intel_gtt btintel drm_kms_helper snd_hda_in >>> tel syscopyarea bluetooth iwlwifi snd_hda_codec snd_hwdep snd_hda_core sysfillre >>> ct snd_seq sysimgblt input_leds fb_sys_fops e1000e ecdh_generic cfg80211 snd_seq >>> _device drm snd_pcm serio_raw ptp pcspkr thinkpad_acpi i2c_i801 snd_timer rtsx_p >>> ci pps_core snd soundcore rfkill video >>> [ 23.779307] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc3+ #378 >>> [ 23.779308] Hardware name: LENOVO 20ARS1BJ02/20ARS1BJ02, BIOS GJET92WW (2.42 >>> ) 03/03/2017 >>> [ 23.779312] RIP: 0010:poll_idle+0x2f/0x5f >>> [ 23.779313] RSP: 0018:ffffffff81c03e80 EFLAGS: 00000246 >>> [ 23.779314] RAX: ffffffff81c0f4c0 RBX: ffffffff81c6db80 RCX: 0000000000000000 >>> [ 23.779315] RDX: 0000000000000000 RSI: ffffffff81c6db80 RDI: ffff88021f2201e8 >>> [ 23.779316] RBP: ffff88021f2201e8 R08: 000000349a65b7dd R09: ffff88021f216db4 >>> [ 23.779317] R10: ffffffff81c03e68 R11: 0000000000000000 R12: 0000000000000000 >>> [ 23.779318] R13: ffffffff81c6db98 R14: 0000000000000000 R15: 0000000578a065b1 >>> [ 23.779319] FS: 0000000000000000(0000) GS:ffff88021f200000(0000) knlGS:00000 >>> 00000000000 >>> [ 23.779320] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 23.779321] CR2: 00007ffed1d0ee60 CR3: 000000021ec0a006 CR4: 00000000001606b0 >>> [ 23.779322] Call Trace: >>> [ 23.779328] cpuidle_enter_state+0x6a/0x2c0 >>> [ 23.779333] do_idle+0x17b/0x1d0 >>> [ 23.779335] cpu_startup_entry+0x6f/0x80 >>> [ 23.779338] start_kernel+0x431/0x451 >>> [ 23.779342] secondary_startup_64+0xa5/0xb0 >>> [ 23.779344] Code: 00 fb 66 0f 1f 44 00 00 65 48 8b 04 25 40 c4 00 00 f0 80 48 >>> 02 20 48 8b 08 83 e1 08 74 0d eb 12 f3 90 65 48 8b 04 25 40 c4 00 00 <48> 8b 00 >>> a8 08 74 ee 65 48 8b 04 25 40 c4 00 00 f0 80 60 02 df >>> >> >> Followup this issue, seems another commit from Thomas partially fixed >> this, kexec/kdump boot up successfully for me, but kexec after kexec >> (2nd kexec reboot cycle) failed, kernel hung early > > The above kexec reboot hang is another problem, so Thomas has fully > fixed previous report, thanks! > > For the kexec reboot hang, if I remove the wbinvd in stop_this_cpu() > then kexec works fine. like this: > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > index 832a6acd730f..6d7499730b27 100644 > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -380,20 +380,8 @@ void stop_this_cpu(void *dummy) > disable_local_APIC(); > mcheck_cpu_clear(this_cpu_ptr(&cpu_info)); > > - for (;;) { > - /* > - * Use wbinvd followed by hlt to stop the processor. This > - * provides support for kexec on a processor that supports > - * SME. With kexec, going from SME inactive to SME active > - * requires clearing cache entries so that addresses without > - * the encryption bit set don't corrupt the same physical > - * address that has the encryption bit set when caches are > - * flushed. To achieve this a wbinvd is performed followed by > - * a hlt. Even if the processor is not in the kexec/SME > - * scenario this only adds a wbinvd to a halting processor. > - */ > - asm volatile("wbinvd; hlt" : : : "memory"); > - } > + for (;;) > + halt(); > } > > /* > > But I have no idea why though, seeking for help and thoughts.. Yeah, I don't know why that works either. Thanks, Tom > >> >> commit bc976233a872c0f20f018fb1e89264a541584e25 >> Author: Thomas Gleixner <tglx@xxxxxxxxxxxxx> >> Date: Fri Dec 29 10:47:22 2017 +0100 >> >> genirq/msi, x86/vector: Prevent reservation mode for non maskable MSI >> >> Thanks >> Dave > > Thanks > Dave > _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec