Re: [Bug 205701] New: Can't access RAM from PCIe

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Dec 6, 2019 at 6:48 PM Ranran <ranshalit@xxxxxxxxx> wrote:
>
> On Fri, Dec 6, 2019 at 5:08 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> >
> > On Fri, Dec 06, 2019 at 08:09:48AM +0200, Ranran wrote:
> > > On Fri, Nov 29, 2019 at 8:38 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > >
> > > > On Fri, Nov 29, 2019 at 06:10:51PM +0200, Ranran wrote:
> > > > > On Fri, Nov 29, 2019 at 4:58 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > > > > On Fri, Nov 29, 2019 at 06:59:48AM +0000, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote:
> > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=205701
> >
> > > I have tried to upgrade to latest kernel 5.4 (elrepo in centos), but
> > > with this processor/board (system x3650, Xeon), it get hang during
> > > kernel boot, without any error in dmesg, just keeps waiting for
> > > nothing for couple of minutes and than drops to dracut.
> >
> > - I don't think you ever said exactly what the original failure mode
> >   was.  You said DMA from an FPGA failed.  What is the specific
> >   device?  How do you know the DMA fails?
> >
>
> Hi,
> FPGA is Intel's Arria 10 device.
> We know that DMA fails because on using signaltap/probing the DMA
> transaction from FPGA to CPU's RAM we see that it stall, i.e. keep
> waiting for the access to finish.
> We don't observe any error in dmesg.
>

Two more notes about this:
1. We know that on same computer (Intel's Xeon, system x3650) the FPGA
can do the transaction without any issues.
2. Using the exact same test module in older compute/cpu (Intel's
DUO), we observe no issues in the dma transaction from FPGA.
The DMA transaction is always from FPGA to CPU's RAM.


>
> > - Re your v5.4 kernel testing, dracut is a user-space distro thing, so
> >   it sounds like your hang is some sort of installation problem that I
> >   can't really help you with.  Maybe there are troubleshooting hints
> >   at https://www.kernel.org/pub/linux/utils/boot/dracut/dracut.html.
>
> I know, that's quite frustrating. I tried to disable features using
> kernel arguments noacpi, noapic, but it still freeze somewhere without
> giving any error,
>
> >   You may also be able to just drop a v5.4 kernel on your v4.18
> >   system, at least for testing purposes.
> >
> What does it mean to drop 5.4 kernel on 4.18 kernel ?
>
>
> > - Your comment #3 in bugzilla is a link to a Google Doc containing a
> >   test module.  In the future, please attach things as plain text
> >   attachments directly to the bugzilla.  There's an "Add attachment"
> >   link immediately before the "Description" comment in bugzilla.  I
> >   did it for you this time.
> >
> > - It looks like your test_module.c is a kernel module, and frankly
> >   it's a mess.  Global variables that should be per-device, unused
> >   variables (dma_get_mask() called for no reason), confused usage
> >   (e.g., using both pci_dev_s and pPciDev), whitespace that appears
> >   random, etc.  I suggest starting with Documentation/PCI/pci.rst and,
> >   at least for this debugging effort, making it a self-contained
> >   driver instead of splitting things between a kernel module and
> >   user-space.
> >
>
> I've attached latest kernel module, which I hope will make it more
> clear, I will try to make it a standalone test next time I'm in lab.
>
> > - Your comment #4 is a link to a Google Doc containing lspci output.
> >   I attached it to bugzilla directly for you.
> >
> > - You apparently didn't run lspci as root ("sudo lspci -vv"), so it
> >   is missing a lot of information.
> >
> > - Your lspci doesn't match either of the dmesg logs.  Please make sure
> >   all your logs are from the same machine in the same configuration.
> >   For example, the first devices found by the kernel (from both
> >   comments #1 and #2) are:
> >
> >     pci 0000:00:00.0: [8086:3c00] type 00 class 0x060000
> >     pci 0000:00:01.0: [8086:3c02] type 01 class 0x060400
> >     pci 0000:00:02.0: [8086:3c04] type 01 class 0x060400
> >     pci 0000:00:02.2: [8086:3c06] type 01 class 0x060400
> >     ...
> >
> >   But the lspci doesn't include 00:01.0, 00:02.0, or 00:02.2.  It
> >   shows:
> >
> >     00:00.0 Host bridge: Intel Corporation Device 2020 (rev 04)
> >     00:04.0 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
> >     00:04.1 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
> >     00:04.2 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
> >     ...
>
>  I will do it in lab tomorrow. Thanks.



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux