RE: 3.12.5 Target Errors

"Moussa Ba (moussaba)" <moussaba@xxxxxxxxxx> · Sat, 17 May 2014 07:44:37 +0000

> -----Original Message-----
> From: Nicholas A. Bellinger [mailto:nab@xxxxxxxxxxxxxxx]
> Sent: Friday, May 16, 2014 2:14 PM
> To: Moussa Ba (moussaba)
> Cc: Sagi Grimberg; target-devel@xxxxxxxxxxxxxxx; Nicholas Bellinger; Or
> Gerlitz; Jared Hulbert (jehulber); Yaron Haviv; roid@xxxxxxxxxxxx; Oren
> Duer
> Subject: Re: 3.12.5 Target Errors
> 
> On Fri, 2014-05-16 at 05:54 +0000, Moussa Ba (moussaba) wrote:
> > I am able to connect to the target without issues using a centos
> > initiator. It logins fast and I can run read/write fio without issues
> > on the same target. Trying to that from esx 5.5 though results in
> > continuous connection drops...Is there something special about the
> esx
> > Initiator?  I am running out of ideas.  I see similar issues with tgt
> > where it completely fails to login.
> >
> > I am running out of ideas...Any suggestion is welcome.
> >
> >
> > Target:
> > FW: 2.30.8000
> > Kernel:3.12.9+patches
> > ConnectX-3 cards are configured as Ethernet cards.
> >
> >  Initiator:
> >  FW 2.31.5050  (it was originally 2.11.500 but I upgraded it but
> failed
> >  to see any difference still seeing the same error)
> >  Driver: 1.9.10.0-1OEM.550.0.0.1331820
> >  Using iser mode
> >
> 
> Just FYI, I've previously encountered some stability issues with
> ConnectX-3's in ethernet mode using older versions of firmware..  On my
> current setup 2.30.8000 <-> 2.30.8000 has been stable in ethernet mode
> for some time, but it probably couldn't hurt to use matching FW
> versions
> on both sides..
> 
> Also, it's been reported offlist that running large MTUs with certain
> (non Mellanox) switches can result in various timeouts + instability.
> It would be worthwhile to verify those settings on both sides as well.
> 
> Mellanox folks..?  Any other ideas to help debug this..?
> 
> --nab

Looks like the issue was HardwareAcceleration in esx...Essentially we would get the timeouts when trying to create a VM Thick Provisioned Eager Zero which translated into esx sending a WRITE_SAME command.  Jared was doing a wireshark capture when he noticed that.  This reminded me that we had to disable the HardwareAcceleration in esx when we were doing VMMark last year.  By default, these values are enabled and LIO seems to advertise that it supports hardware acceleration based on the datastore characteristics in ESX.

VMFS3.HardwareAcceleratedLocking
DataMover.HardwareAcceleratedMove
DataMover.HardwareAcceleratedInit

As soon as we disabled them, no more time out issues...  I believe WRITE_SAME/XCOPY and ATS only made it into LIO in 3.14?  The question I have is where does this information belong and how can one debug these issues...

P.S.:I am attaching the wireshark capture.

Moussa
Attachment:
wireshark-writesame.cap.gz

Description: wireshark-writesame.cap.gz