> -----Original Message----- > From: Volker Schwicking [mailto:volker.schwicking@xxxxxxxxxxx] > Sent: Thursday, April 26, 2018 8:22 PM > To: Kashyap Desai > Cc: Martin K. Petersen; linux-scsi@xxxxxxxxxxxxxxx; Sumit Saxena; > Shivasharan > Srikanteshwara > Subject: Re: MegaCli fails to communicate with Raid-Controller > > On 23. Apr 2018, at 11:03, Volker Schwicking > <volker.schwicking@xxxxxxxxxxx> wrote: > > > > I will add the printk to dma_alloc_coherent() as well to see, which > > request > actually fails. But i have to be a bit patient since its a production > system and > the customers aren’t to happy about reboots. > > Alright, here are some results. > > Looking at my debug lines i can tell, that requesting either 2048 or 4 > regularly > fail. Other values don’t ever show up as failed, but there are several as > you > can see in the attached log. > > The failed requests: > ### > $ grep 'GD IOV-len FAILED' /var/log/kern.log | awk '{ print $9, $10 }' | > sort | > uniq -c > 59 FAILED: 2048 > 64 FAILED: 4 > ### Thanks.! This helps to understand the problem. Few question - What is a frequency of this failure ? Can you reproduce on demand ? Are you able to see no failure on 4.6 kernel ? How your setup looks like ? Are you running VM or this failure is on host OS. Can you share full dmesg logs ? > > I attached full debugging output from several executions of > “megacli -ldpdinfo > -a0” in 5 second intervals, successful and failed and content from > /proc/buddyinfo again. > > Can you make any sense of that? Where should i go from here? May be better to find out call trace of dma_alloc_coherent using ftrace. Depending upon DMA engine configured, failure may be related to those DMA engine code changes. Can you get those ftrace logs as well. ? You may have to cherry pick ftrace filter around dma_alloc_coherent(). I quickly grep in arch/xen to see something related to memory allocation and found that pci_xen_swiotlb_detect() has some methods to enable/disable certain features and one of the key factor is DMA range 32 bit or 64 bit. Since older controller is requesting DMA buffer below 4GB region, some kind of code changes in those are from 4.6 -> 4.14.x might be a possible reason of the frequent memory allocation failure. This is my wild guess based on the info that 4.6 is *not at all* exposured to memory failure at the same frequency of 4.14. Kashyap