On Fri, Dec 06, 2019 at 06:48:24PM +0200, Ranran wrote: > On Fri, Dec 6, 2019 at 5:08 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Fri, Dec 06, 2019 at 08:09:48AM +0200, Ranran wrote: > > > On Fri, Nov 29, 2019 at 8:38 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > On Fri, Nov 29, 2019 at 06:10:51PM +0200, Ranran wrote: > > > > > On Fri, Nov 29, 2019 at 4:58 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > > > On Fri, Nov 29, 2019 at 06:59:48AM +0000, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote: > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=205701 > > > > > I have tried to upgrade to latest kernel 5.4 (elrepo in centos), but > > > with this processor/board (system x3650, Xeon), it get hang during > > > kernel boot, without any error in dmesg, just keeps waiting for > > > nothing for couple of minutes and than drops to dracut. > > > > - I don't think you ever said exactly what the original failure mode > > was. You said DMA from an FPGA failed. What is the specific > > device? How do you know the DMA fails? > > FPGA is Intel's Arria 10 device. I really meant which bus/device/function it is so we can correlate it with the dmesg log and lspci output. > We know that DMA fails because on using signaltap/probing the DMA > transaction from FPGA to CPU's RAM we see that it stall, i.e. keep > waiting for the access to finish. > We don't observe any error in dmesg. I'm not familiar with Signal Tap, but Google suggests that it's basically an embedded logic analyzer on the FPGA itself. So I assume that: - On the working system (Intel DUO?) Signal Tap shows the PCIe Memory Read TLP from the FPGA and the matching Completion. - On the non-working system Signal Tap shows the PCIe Memory Read TLP from the FPGA but the Completion never arrives. I assume the FPGA eventually logs a Completion Timeout error? My guess would be something's wrong with the address the FPGA is generating. So please collect the complete dmesg log and /proc/iomem contents and the address used in the FPGA DMA TLP from both the working and non-working systems. There should be some clue if we look at the differences between the systems. > > You may also be able to just drop a v5.4 kernel on your v4.18 > > system, at least for testing purposes. > > > What does it mean to drop 5.4 kernel on 4.18 kernel ? Not on a v4.18 *kernel*; on the CentOS *file system* that was installed along with your v4.18-based kernel. If you take a v5.4 kernel built with the right config options/modules/etc, it should work on the same root filesystem as the v4.18 kernel. Bjorn