> snip > > > Just FYI, I've previously encountered some stability issues with > > > ConnectX-3's in ethernet mode using older versions of firmware.. > On my > > > current setup 2.30.8000 <-> 2.30.8000 has been stable in ethernet > mode > > > for some time, but it probably couldn't hurt to use matching FW > > > versions > > > on both sides.. > > > > > > Also, it's been reported offlist that running large MTUs with > certain > > > (non Mellanox) switches can result in various timeouts + > instability. > > > It would be worthwhile to verify those settings on both sides as > well. > > > > > > Mellanox folks..? Any other ideas to help debug this..? > > > > > > --nab > > > > > > Looks like the issue was HardwareAcceleration in esx...Essentially we > > would get the timeouts when trying to create a VM Thick Provisioned > > Eager Zero which translated into esx sending a WRITE_SAME command. > > Jared was doing a wireshark capture when he noticed that. > > > > This reminded me that we had to disable the HardwareAcceleration in > > esx when we were doing VMMark last year. By default, these values > are > > enabled and LIO seems to advertise that it supports hardware > > acceleration based on the datastore characteristics in ESX. > > > > > > VMFS3.HardwareAcceleratedLocking > > DataMover.HardwareAcceleratedMove > > DataMover.HardwareAcceleratedInit > > > > As soon as we disabled them, no more time out issues... > > Ah yes, thanks for confirming. > > > I believe WRITE_SAME/XCOPY and ATS only made it into LIO in 3.14? > > So WRITE_SAME support for IBLOCK went in v3.6, along with generic > EXTENDED_COPY + COMPARE_AND_WRITE support in v3.12. > > > The question I have is where does this information belong and how can > > one debug these issues... > > > > FYI, these VAAI primitives can also be disabled target side with device > attributes: > > emulate_caw=0 > emulate_3pc=0 > max_write_same_len=0 > > To debug, please try ESX host settings Init=0 + Move=1 + Locking=1 to > see if it's specific to WRITE_SAME, and separately if COMPARE_AND_WRITE > traffic can also trigger the bug.. Setting Init=0, Move=1 and Locking=1 does not create the time out issue. So so far it seems the issues is specific to WRITE_SAME. I will > > Also, what do the negotiated ImmediateData + InitialR2T parameter > settings look like..? Both are set to yes. I am reading these off of /sys/kernel/config/.../iqn..../param/ > > Thanks Moussa! > > --nab > > PS: Also grab the EXTENDED_COPY memory leak bugfix from Mikulas: > > https://git.kernel.org/cgit/linux/kernel/git/nab/target- > pending.git/commit/?id=1e1110c43b1cda9fe77fc4a04835e460550e6b3c ��.n��������+%������w��{.n����j�����{ay�ʇڙ���f���h������_�(�階�ݢj"��������G����?���&��