I ran into this with 4.9.32 when I rebooted the target. I tested 4.12-rc6 and this particular error seems to have been resolved, but I now get a new one on the initiator. This one doesn't seem as impactful. [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2 [Mon Jun 19 11:17:20 2017] iser: iser_err_comp: command failure: local protection error (4) vend_err 52 [Mon Jun 19 11:17:20 2017] connection3:0: detected conn error (1011) [Mon Jun 19 11:17:31 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe [Mon Jun 19 11:17:31 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe [Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:17:31 2017] 00000000 93005204 0a0001e7 45dd82d2 [Mon Jun 19 11:17:31 2017] iser: iser_err_comp: command failure: local protection error (4) vend_err 52 [Mon Jun 19 11:17:31 2017] connection4:0: detected conn error (1011) [Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:17:31 2017] 00000000 93005204 0a0001f4 004915d2 [Mon Jun 19 11:17:31 2017] iser: iser_err_comp: command failure: local protection error (4) vend_err 52 [Mon Jun 19 11:17:31 2017] connection3:0: detected conn error (1011) [Mon Jun 19 11:17:44 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe [Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:17:44 2017] 00000000 93005204 0a0001f6 004519d2 [Mon Jun 19 11:17:44 2017] iser: iser_err_comp: command failure: local protection error (4) vend_err 52 [Mon Jun 19 11:17:44 2017] connection3:0: detected conn error (1011) [Mon Jun 19 11:18:55 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe [Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:18:55 2017] 00000000 93005204 0a0001f7 01934fd2 [Mon Jun 19 11:18:55 2017] iser: iser_err_comp: command failure: local protection error (4) vend_err 52 [Mon Jun 19 11:18:55 2017] connection3:0: detected conn error (1011) [Mon Jun 19 11:20:25 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe [Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000 [Mon Jun 19 11:20:25 2017] 00000000 93005204 0a0001f8 0274edd2 [Mon Jun 19 11:20:25 2017] iser: iser_err_comp: command failure: local protection error (4) vend_err 52 [Mon Jun 19 11:20:25 2017] connection3:0: detected conn error (1011) I'm going to try to cherry-pick the fix to 4.9.x and do some testing there. Thanks, ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, May 18, 2017 at 7:34 AM, Leon Romanovsky <leon@xxxxxxxxxx> wrote: > On Wed, May 17, 2017 at 02:56:36PM +0200, Marta Rybczynska wrote: >> > On Mon, May 15, 2017 at 07:59:52AM -0700, Christoph Hellwig wrote: >> >> On Mon, May 15, 2017 at 05:36:32PM +0300, Leon Romanovsky wrote: >> >> > I understand you and both Max and me are feeling the same as you. For more >> >> > than 2 months, we constantly (almost on daily basis) asked for a solution from >> >> > architecture group, but received different answers. The proposals were >> >> > extremely broad from need for strong fence for all cards to no need for >> >> > strong fence at all. >> >> >> >> So let's get the patch to do a strong fence everywhere now, and relax >> >> it later where possible. >> >> >> >> Correntness before speed.. >> > >> > OK, please give me and Max till EOW to stop this saga. One of the two >> > options will be: Max will resend original patch, or Max will send patch >> > blessed by architecture group. >> > >> >> Good luck with this Max & Leon! It seems to be a complicated problem. >> Just an idea: in our case it *seems* that the problem started appearing >> after a firmware upgrade, older ones do not seem to have the same >> behaviour. Maybe it's a hint for you. > > OK, we came to the agreement which capability bits we should add. Max > will return to the office at the middle of the next week and we will > proceed with the submission of proper patch once our shared code will > be accepted. > > In the meantime, i put the original patch to be part of our regression. > https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=testing/queue-next&id=a40ac569f243db552661e6efad70080bb406823c > > Thank you for your patience. > >> >> Thanks! >> Marta >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html