RE: MegaCli fails to communicate with Raid-Controller

Kashyap Desai <kashyap.desai@xxxxxxxxxxxx> · Fri, 27 Apr 2018 12:07:40 +0530

> -----Original Message-----
> From: Volker Schwicking [mailto:volker.schwicking@xxxxxxxxxxx]
> Sent: Thursday, April 26, 2018 8:22 PM
> To: Kashyap Desai
> Cc: Martin K. Petersen; linux-scsi@xxxxxxxxxxxxxxx; Sumit Saxena;
> Shivasharan
> Srikanteshwara
> Subject: Re: MegaCli fails to communicate with Raid-Controller
>
> On 23. Apr 2018, at 11:03, Volker Schwicking
> <volker.schwicking@xxxxxxxxxxx> wrote:
> >
> > I will add the printk to dma_alloc_coherent() as well to see, which
> > request
> actually fails. But i have to be a bit patient since its a production
> system and
> the customers aren’t to happy about reboots.
>
> Alright, here are some results.
>
> Looking at my debug lines i can tell, that requesting either 2048 or 4
> regularly
> fail. Other values don’t ever show up as failed, but there are several  as
> you
> can see in the attached log.
>
> The failed requests:
> ###
> $ grep 'GD IOV-len FAILED' /var/log/kern.log  | awk '{ print $9, $10 }' |
> sort |
> uniq -c
>      59 FAILED: 2048
>      64 FAILED: 4
> ###

Thanks.! This helps to understand the problem. Few question -

What is a frequency of this failure ? Can you reproduce on demand ?
Are you able to see no failure on 4.6 kernel ?
How your setup looks like ? Are you running VM or this failure is on host
OS. Can you share full dmesg logs ?

>
> I attached full debugging output from several executions of
> “megacli -ldpdinfo
> -a0” in 5 second intervals, successful and failed and content from
> /proc/buddyinfo again.
>
>  Can you make any sense of that? Where should i go from here?

May be better to find out call trace of dma_alloc_coherent using ftrace.
Depending upon DMA engine configured, failure may be related to those DMA
engine code changes.
Can you get those ftrace logs as well. ? You may have to cherry pick ftrace
filter around dma_alloc_coherent().

I quickly grep in arch/xen to see something related to memory allocation and
found that pci_xen_swiotlb_detect() has some methods to enable/disable
certain features and one of the key factor is DMA range 32 bit or 64 bit.
Since older controller is requesting DMA buffer below 4GB region, some kind
of code changes in those are from 4.6 -> 4.14.x might be a possible reason
of the frequent memory allocation failure. This is my wild guess based on
the info that 4.6 is  *not at all* exposured to memory failure at the same
frequency of 4.14.

Kashyap