2013/10/24 Bjorn Helgaas <bhelgaas@xxxxxxxxxx>: > On Tue, Oct 22, 2013 at 8:02 PM, wubo <wuborush@xxxxxxxxx> wrote: >> Hi, all >> >> Sorry for troubling you. >> We are developing msix feature on our product, unfortunately it will >> lead kernel to crash >> on a server PC whose cpu is Intel(R) Xeon(R) CPU E5645, and we are >> sure that our driver is good >> on common personal PC. >> >> A piece code in our driver like that: >> for (i = 0; i < msix_num; i++) { >> msix = &pcie->msix_entries[i]; >> msix->entry = i; >> } >> ret = pci_enable_msix(XX); >> for (i = 0; i < msix_num; i++) { >> msix = &pcie->msix_entries[i]; >> ret = request_irq(msix->vector, XX); >> } >> >> BTW, the kernel crash info is as follows: >> [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 >> [Hardware Error]: APEI generic hardware error status >> [Hardware Error]: severity: 1, fatal >> [Hardware Error]: section: 0, severity: 1, fatal >> [Hardware Error]: flags: 0x01 >> [Hardware Error]: primary >> [Hardware Error]: section_type: PCIe error >> [Hardware Error]: port_type: 0, PCIe end point >> [Hardware Error]: version: 1.0 >> [Hardware Error]: command: 0x0407, status: 0x0010 >> [Hardware Error]: device_id: 0000:04:00.0 >> [Hardware Error]: slot: 2 >> [Hardware Error]: secondary_bus: 0x00 >> [Hardware Error]: vendor_id: 0x1c5f, device_id: 0x0530 >> [Hardware Error]: class_code: 008001 >> >> Do I miss something important?? Can anybody give me some hints? > > It's very difficult to give any hints based on so little information. > The error looks like a PCIe hardware issue, which should not > necessarily cause the kernel itself to crash (and if the kernel *did* > crash, you didn't include any information about that). Thank you very much for your replay. Truely the kernel crashed and the info above is from kdump. > > I don't know how to interpret this APEI error info. It's possible > that your BIOS logged it and can give more details. The most likely > problem is that you programmed some incorrect MSI address/data info > into the device, and when it attempted to signal an MSI, it caused the > error. Or it could be a regular device DMA gone awry. Seemd the msi address/data info sent from hardware is right enough, I have keeping watch on the hardware sending message. things like that: Message address Read:fee20000 //send to cpu core 32 without irq route Message data Read:40c9 // fixed mode But after such as ACPI stuff transform, the irq become a illegal number, I guess. > > You could compare your driver's MSI handling with other drivers in the > tree. You could try to figure out the difference between the "common > personal PC" (where your driver apparently works) and the server PC > (where it fails) -- boot the server with a reduced configuration > (fewer CPUs, fewer other devices, etc.) to make it more like the > personal PC. The biggest different is APCI mode, which will effect the msi address/data config in pci_enable_msix. But I have so little knowledge about that APCI. And I also sure the INT-x interrupt is good on server for my driver. You could try using fewer MSI-X IRQs. You could try > using MSI or line-based interrupts to make sure it's really an > MSI-related problem. > > Since most drivers do use MSI-X successfully, the problem is likely in > your driver, not in the Linux PCI code. I've given you some hints > above, but in general, people don't have time to help debug > proprietary, out-of-tree drivers. yeh I know, I just hope somebody just have came across about it by lucky. Anyway thanks again for your kindness. > > Bjorn -- Thanks, Wubo -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html