Re: Upstream mlx4 driver very broken (when using SRIOV)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Jun 13, 2015 at 8:35 AM, Doug Ledford <dledford@xxxxxxxxxx> wrote:
> I ran across a problem today when I went to do some run tests of my
> for-4.2 tree.  For a second there, I was about to seriously have a
> conniption fit.  But, after about 6 hours of work bisecting and
> debugging, I've come to find that I wasn't so crazy after all.
>
> When I went to install my for-4.2 tree, IPoIB was totally busted, as in
> DOA.  I knew the 4.1 code I submitted to Linus I had checked, but I
> wanted to have a good starting point for a bisection so I compiled a
> kernel from my for-4.1-rc branch.  And it was DOA too.  That seriously
> unnerved me because I knew I tested that code.  I did a number of manual
> checkouts at possible suspicious code points, and none of them showed
> that the problem was resolved.  Then I started doing some debugging on
> both the afflicted machine and on the opensm server.  I finally saw that
> the afflicted machine was claiming that it was attempting to join the
> multicast group, but was reporting error 110 (ETIMEDOUT).  The opensm
> server was not seeing the requests at all.
>
> Long story short, I did my testing in the 4.1 merge window and rc phase
> on machines without SRIOV enabled, but when you enable SRIOV in the mlx4
> driver, the current driver seems to have broken QP0/QP1 multiplexing
> support because the host becomes unable to join the IPoIB multicast
> groups.  In addition, with SRIOV enabled, mlx4_en throws corruption
> errors on reboot and requires that the machine be power cycled as
> opposed to rebooting cleanly.  From what I can tell, the 4.0 release
> kernel has this problem too, and it still exists at least as far as
> 4.1-rc7 + all of my queued up -next patches.
>
> From my /etc/modprobe.d/mlx4.conf file if you want to try and duplicate:
>
> options mlx4_core probe_vf=0 num_vfs=7 port_type_array=1,2

Doug,

You were 100% right, due to recent FW bug SRIOV QP0/QP1 PV is broken
with VPI config of IB/Eth (port_type_array=1,2), personally, I didn't
step on it, since I moved my working environment to Eth/IB (2,1)
couple of weeks ago, Oh well.

The fix is easy, disable Granular VF QoS in that VPI config, I tested
it and sent that now to net [1]

We should check how come the upstream regression environment didn't
catch that up.

Or.

[1] http://patchwork.ozlabs.org/patch/483991/

> options mlx4_en pfctx=0x28 pfcrx=0x28
>
> And I'm guessing that your internal regression tests must not have a
> machine in IB/Eth SRIOV mode as a standard config.  I would consider
> adding it to the mix.  I have it myself, but only on a few machines and
> I don't always use them for initial testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux