On Fri, Dec 6, 2019 at 6:48 PM Ranran <ranshalit@xxxxxxxxx> wrote: > > On Fri, Dec 6, 2019 at 5:08 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > On Fri, Dec 06, 2019 at 08:09:48AM +0200, Ranran wrote: > > > On Fri, Nov 29, 2019 at 8:38 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > > > > > On Fri, Nov 29, 2019 at 06:10:51PM +0200, Ranran wrote: > > > > > On Fri, Nov 29, 2019 at 4:58 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > > > On Fri, Nov 29, 2019 at 06:59:48AM +0000, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote: > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=205701 > > > > > I have tried to upgrade to latest kernel 5.4 (elrepo in centos), but > > > with this processor/board (system x3650, Xeon), it get hang during > > > kernel boot, without any error in dmesg, just keeps waiting for > > > nothing for couple of minutes and than drops to dracut. > > > > - I don't think you ever said exactly what the original failure mode > > was. You said DMA from an FPGA failed. What is the specific > > device? How do you know the DMA fails? > > > > Hi, > FPGA is Intel's Arria 10 device. > We know that DMA fails because on using signaltap/probing the DMA > transaction from FPGA to CPU's RAM we see that it stall, i.e. keep > waiting for the access to finish. > We don't observe any error in dmesg. > Two more notes about this: 1. We know that on same computer (Intel's Xeon, system x3650) the FPGA can do the transaction without any issues. 2. Using the exact same test module in older compute/cpu (Intel's DUO), we observe no issues in the dma transaction from FPGA. The DMA transaction is always from FPGA to CPU's RAM. > > > - Re your v5.4 kernel testing, dracut is a user-space distro thing, so > > it sounds like your hang is some sort of installation problem that I > > can't really help you with. Maybe there are troubleshooting hints > > at https://www.kernel.org/pub/linux/utils/boot/dracut/dracut.html. > > I know, that's quite frustrating. I tried to disable features using > kernel arguments noacpi, noapic, but it still freeze somewhere without > giving any error, > > > You may also be able to just drop a v5.4 kernel on your v4.18 > > system, at least for testing purposes. > > > What does it mean to drop 5.4 kernel on 4.18 kernel ? > > > > - Your comment #3 in bugzilla is a link to a Google Doc containing a > > test module. In the future, please attach things as plain text > > attachments directly to the bugzilla. There's an "Add attachment" > > link immediately before the "Description" comment in bugzilla. I > > did it for you this time. > > > > - It looks like your test_module.c is a kernel module, and frankly > > it's a mess. Global variables that should be per-device, unused > > variables (dma_get_mask() called for no reason), confused usage > > (e.g., using both pci_dev_s and pPciDev), whitespace that appears > > random, etc. I suggest starting with Documentation/PCI/pci.rst and, > > at least for this debugging effort, making it a self-contained > > driver instead of splitting things between a kernel module and > > user-space. > > > > I've attached latest kernel module, which I hope will make it more > clear, I will try to make it a standalone test next time I'm in lab. > > > - Your comment #4 is a link to a Google Doc containing lspci output. > > I attached it to bugzilla directly for you. > > > > - You apparently didn't run lspci as root ("sudo lspci -vv"), so it > > is missing a lot of information. > > > > - Your lspci doesn't match either of the dmesg logs. Please make sure > > all your logs are from the same machine in the same configuration. > > For example, the first devices found by the kernel (from both > > comments #1 and #2) are: > > > > pci 0000:00:00.0: [8086:3c00] type 00 class 0x060000 > > pci 0000:00:01.0: [8086:3c02] type 01 class 0x060400 > > pci 0000:00:02.0: [8086:3c04] type 01 class 0x060400 > > pci 0000:00:02.2: [8086:3c06] type 01 class 0x060400 > > ... > > > > But the lspci doesn't include 00:01.0, 00:02.0, or 00:02.2. It > > shows: > > > > 00:00.0 Host bridge: Intel Corporation Device 2020 (rev 04) > > 00:04.0 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04) > > 00:04.1 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04) > > 00:04.2 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04) > > ... > > I will do it in lab tomorrow. Thanks.