Re: 3.12.5 Target Errors

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Sun, 18 May 2014 00:23:40 -0700

On Sat, 2014-05-17 at 07:44 +0000, Moussa Ba (moussaba) wrote:
> > -----Original Message-----
> > From: Nicholas A. Bellinger [mailto:nab@xxxxxxxxxxxxxxx]
> > Sent: Friday, May 16, 2014 2:14 PM
> > To: Moussa Ba (moussaba)
> > Cc: Sagi Grimberg; target-devel@xxxxxxxxxxxxxxx; Nicholas Bellinger; Or
> > Gerlitz; Jared Hulbert (jehulber); Yaron Haviv; roid@xxxxxxxxxxxx; Oren
> > Duer
> > Subject: Re: 3.12.5 Target Errors
> > 
> > On Fri, 2014-05-16 at 05:54 +0000, Moussa Ba (moussaba) wrote:
> > > I am able to connect to the target without issues using a centos
> > > initiator. It logins fast and I can run read/write fio without issues
> > > on the same target. Trying to that from esx 5.5 though results in
> > > continuous connection drops...Is there something special about the
> > esx
> > > Initiator?  I am running out of ideas.  I see similar issues with tgt
> > > where it completely fails to login.
> > >
> > > I am running out of ideas...Any suggestion is welcome.
> > >
> > >
> > > Target:
> > > FW: 2.30.8000
> > > Kernel:3.12.9+patches
> > > ConnectX-3 cards are configured as Ethernet cards.
> > >
> > >  Initiator:
> > >  FW 2.31.5050  (it was originally 2.11.500 but I upgraded it but
> > failed
> > >  to see any difference still seeing the same error)
> > >  Driver: 1.9.10.0-1OEM.550.0.0.1331820
> > >  Using iser mode
> > >
> > 
> > Just FYI, I've previously encountered some stability issues with
> > ConnectX-3's in ethernet mode using older versions of firmware..  On my
> > current setup 2.30.8000 <-> 2.30.8000 has been stable in ethernet mode
> > for some time, but it probably couldn't hurt to use matching FW
> > versions
> > on both sides..
> > 
> > Also, it's been reported offlist that running large MTUs with certain
> > (non Mellanox) switches can result in various timeouts + instability.
> > It would be worthwhile to verify those settings on both sides as well.
> > 
> > Mellanox folks..?  Any other ideas to help debug this..?
> > 
> > --nab
> 
> 
> Looks like the issue was HardwareAcceleration in esx...Essentially we
> would get the timeouts when trying to create a VM Thick Provisioned
> Eager Zero which translated into esx sending a WRITE_SAME command.
> Jared was doing a wireshark capture when he noticed that.
>
> This reminded me that we had to disable the HardwareAcceleration in
> esx when we were doing VMMark last year.  By default, these values are
> enabled and LIO seems to advertise that it supports hardware
> acceleration based on the datastore characteristics in ESX.
> 
> 
> VMFS3.HardwareAcceleratedLocking
> DataMover.HardwareAcceleratedMove
> DataMover.HardwareAcceleratedInit
> 
> As soon as we disabled them, no more time out issues...

Ah yes, thanks for confirming.

> I believe WRITE_SAME/XCOPY and ATS only made it into LIO in 3.14?

So WRITE_SAME support for IBLOCK went in v3.6, along with generic
EXTENDED_COPY + COMPARE_AND_WRITE support in v3.12.

> The question I have is where does this information belong and how can
> one debug these issues...
> 

FYI, these VAAI primitives can also be disabled target side with device
attributes:

  emulate_caw=0
  emulate_3pc=0
  max_write_same_len=0

To debug, please try ESX host settings Init=0 + Move=1 + Locking=1 to
see if it's specific to WRITE_SAME, and separately if COMPARE_AND_WRITE
traffic can also trigger the bug..

Also, what do the negotiated ImmediateData + InitialR2T parameter
settings look like..?

Thanks Moussa!

--nab

PS: Also grab the EXTENDED_COPY memory leak bugfix from Mikulas:

https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?id=1e1110c43b1cda9fe77fc4a04835e460550e6b3c

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html