Re: [PATCH 1/2] net/mlx5: increase async EQ to avoid EQ overrun

Jason Gunthorpe <jgg@xxxxxxxxxxxx> · Mon, 5 Feb 2018 16:16:17 -0700

On Tue, Feb 06, 2018 at 01:11:41AM +0200, Max Gurtovoy wrote:
> 
> 
> On 2/5/2018 8:09 PM, Jason Gunthorpe wrote:
> >On Mon, Feb 05, 2018 at 04:29:51PM +0200, Max Gurtovoy wrote:
> >>Currently the async EQ has 256 entries only. It might not be big enough
> >>for the SW to handle all the needed pending events. For example, in case
> >>of many QPs (let's say 1024) connected to a SRQ created using NVMeOF target
> >>and the target goes down, the FW will raise 1024 "last WQE reached" events
> >>and may cause EQ overrun. Increase the EQ to more reasonable size, that beyond
> >>it the FW should be able to delay the event and raise it later on using internal
> >>backpressure mechanism.
> >
> >If the firmware has an internal backpressure meachanism then why
> >would we get a EQ overrun?
> 
> FW backpressure mechanism is WIP, that's why we get the overrun.

Ah, so current HW blows up if EQ is overrun and that can actually be
triggered by ULPs? Yuk

> After consulting with FW team, we conclude that 256 EQ depth is small.
> Do you think it's reasonable to allocate 4k entries (256KB of contig memory)
> for async EQ ?

No idea, ask Saeed?

> >Do we need to block adding too many QPs to a SRQ as well or something
> >like that?
> 
> Hard to say. In the storage world, this may lead to a situation that
> initiator X has priority over initiator Y on without any good reason (only
> because X was served before Y)..

Well, correctness comes first, so if the device does have to protect
itself from rouge ULPS.. If that means enforcing a goofy limit, then
so be it :(

Presumably someday fixed firmware will remove the limitation?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html