On 08.11.23 06:16, Takashi Sakamoto wrote: > On Tue, Nov 07, 2023 at 03:27:08PM -0600, Mario Limonciello wrote: >> +linux-pci / Bjorn >> On 11/7/2023 06:17, Takashi Sakamoto wrote: >>> >>> Thanks for the report. >>> >>> I apologize for the inconvenience you and your reporter facing, however >>> I can not avoid to say that the problem appears to be specific to the AMD >>> Ryzen machines. It nevertheless from the point of kernel development *afaics* is a kernel regression, as it doesn't matter much if the root of the problem is a hw problem; what matters primarily is that the problem apparently did not happen before the commit mentioned in the subject. Which bears the question: Can the culprit (and others commits that might be depending on it) still be reverted without causing regression itself? Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. >> Unfortunately I don't have this 1394 hardware myself. I was just looking at >> another completely unrelated issue on Bugzilla and noticed the report come >> up in my search and wanted to ensure it's on your radar already as the >> author as it's lingered a while. > > It is your misfortune to face this kind of machine trouble. > > In the report[1], Matthias Schrumpf and Mark Broadworth noted to use AMD > Ryzen 7 5800X on B550/X570 chipsets, and insert VT6307 in their PCIe bus. > I guess that the device attends PCI bridge (ASM1083) since VT6307 has PCI > interface only. > > We can see MCE error in another report[2]. Unfortunately, the reporter, > Ian Donnelly, have less suspiction about machine architecture, and never > provides hardware information. But I believe that it comes from AMD Ryzen > machine. I transcribe the error here: > > ``` > [ 0.860834] mce: [Hardware Error]: Machine check events logged > [ 0.860834] microcode: CPU20: patch_level=0x0a201025 > [ 0.860835] microcode: CPU21: patch_level=0x0a201025 > [ 0.860836] microcode: CPU23: patch_level=0x0a201025 > [ 0.860836] microcode: CPU22: patch_level=0x0a201025 > [ 0.860837] mce: [Hardware Error]: CPU 17: Machine Check: 0 Bank 0: bc00080001010135 > [ 0.860845] fbcon: Taking over console > [ 0.860847] mce: [Hardware Error]: TSC 0 ADDR fca000f0 MISC d012000000000000 IPID 1000b000000000 > [ 0.860854] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1696955537 SOCKET 0 APIC b microcode a201025 > [ 0.860860] microcode: CPU0: patch_level=0x0a201025 > [ 0.861676] microcode: Microcode Update Driver: v2.2. > ``` > > Additionally, as I note in the PR[3], I observed cache-coherence failure > over memory dedicated for DMA transmission. The mapping is created by > `dmam_alloc_coherent()` and no need to have extra care such as streaming > API. However, the combination of ASM1083 and VT6307 provides me bogus > values from the memory in AMD Ryzen machine, and I can see no issue in > Intel machines. > > Essentially, the host system reboots when firewire-ohci module in guest > system probes the PCI device for 1394 OHCI hardware provided by PCI > pass-though[4]. > >>> I've already received the similar report[1], and have been >>> investigating it in the last few weeks, then got the insight. Please take >>> a look at my short report about it in PR to Linus for 6.7-rc1: >>> https://lore.kernel.org/lkml/20231105144852.GA165906@workstation.local/ >>> >>> I can confirm that I have been abe to reproduce the problem on AMD Ryzen >>> machine. However, it's important to note that I have not observed the >>> problem on the following systems: >> >> Any chance you (or anyone with the issue) has a serial output available? >> I think it would be really good to look at the circumstances surrounding the >> reboot. >> >>> >>> * Intel machine (Sandy Bridge and Skylake generations) >>> * AMD machines predating Ryzen (Sempron 145) >>> * Machines using different 1394 OHCI hardware from other vendors such as >>> TI >>> * VIA VT6307 connected directly to PCI slot (i.e. without the issued >>> PCIe/PCI bridge) >>> >>> Currently, I have not been able to obtain any useful debug output from >>> the Linux system or any hardware error reports when the system reboots. >>> It seems that the system reboots spontaneously. My assumption at this >>> point is that AMD Ryzen machines detect a specific hardware error >>> triggered by Ryzen machine quirk related to the combination of the Asmedia >>> ASM1083/1085 and VIA VT6306/6307/6308, leading to power reset. >>> >> >> Recent kernels have enabled PCI AER. Could that be factoring in perhaps? > > I ordered equipments for the workflow, and waiting for shipping, since > my motherboard has no interface for serial output. > > (However, I predict that we can no helpful output via the interface.) > >>> I genuinely appreciate your assistance in debugging this elusive >>> hardware issue. If any workaround specific to AMD Ryzen machine quirk is >>> required in PCI driver for 1394 OHCI hardware, I'm willing to apply it. >>> However, it is preferable to figure out the reboot mechanism at first, >>> I think. >> >> Does the BIOS on these machines enable a watchdog timer? If so, I'd suggest >> disabling that for a starting point. > > For consumer use, the machine has no such function, I think. For > your information, this is the machine information I used: > > * Ryzen 5 2400G > * Gigabyte Technology Co., Ltd. AX370-Gaming 5/AX370-Gaming 5 > * BIOS F51h 02/09/2023 > >> How about if you compile as a module and then modprobe.blacklist the module >> on kernel command line and load it later. Can you trigger the fault/reboot >> this way? If so, it at least rules out some conditions that happen during a >> race at boot. > > Nowadays FireWire software stack is optional in the most of > distributions. I can encounter the same issue at deferred probing enough > after booting up, even if the load of system is very low. > >> Looking more closely at the change, I would guess the fault is specifically >> in get_cycle_time(). I can see that the VIA devices do set >> QUIRK_CYCLE_TIMER which will cause additional reads. > > I've already tested with the driver compiled without these codes, but the > system reboots again. > >> Another guesses worth looking at is to see if iommu=pt or amd_iommu=off >> help. >> >> If either of those help it could point at being a problem with >> get_cycle_time() and IOMMU. The older systems you mentioned working >> probably didn't enable IOMMU by default but most AMD Ryzen systems do. > > I already suspect platform IOMMU and kernel implementation, however it > is helpless to disable AMD SVM and IOMMU in BIOS settings. Of course, it > is helpless as well to provide any options to iommu in kernel command line. > > If I had any opportunity to access to AMD machines for enterprise-grade > usage somehow, I would have done it. However, I am a private-time > contributor and what I can access to is the ones for consumer use > without any hardware support for RAS reporting. > > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=217993 > [2] https://bugzilla.kernel.org/show_bug.cgi?id=217994 > [3] https://lore.kernel.org/lkml/20231105144852.GA165906@workstation.local/ > [4] https://lore.kernel.org/lkml/20231016155657.GA7904@workstation.local/ > > Thanks > > Takashi Sakamoto > >