Re: Lower than expected iSCSI performance compared to CIFS

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Sun, 25 Aug 2013 22:18:30 -0700

On Sun, 2013-08-25 at 22:26 -0600, Scott Hallowell wrote:
> Nicholas, Jörn
> 

Hi Scott,

Btw, for future reference please do not top-post your responses, as it
makes it more difficult to read.

> Thanks for your response on this.  I apologize for the delay getting
> back to the list with my results.
> 
> I am running iblock, in my specific case.  While I see better results
> with fileio, I'd like to stick with iblock if possible.

To confirm, when you enable buffered FILEIO, your able to reach
comparable results with Samba, right..?

If your able to switch backends and reach 1 Gb/sec performance, that
would tend to indicate that it's something specific to the backend, and
not an iscsi fabric specific issue.

> The NAS I am comparing against, which is performing surprisingly well,
> is also set up to use iblock. 

Please share what NAS and the version of LIO that it's using for
comparison.  (eg: cat /sys/kernel/config/target/version)

>  I had default_cmdsn_depth still at 16 (as did the
> NAS I am comparing against), but I did not see an improvement in my
> robocopy results in Windows 7 when I increased the number of
> outstanding commands to 64 or 128.  That may be because the data
> transfer is using larger blocks (I believe it is configured for
> ~250K), and requiring fewer commands to transfer a given amount of
> data.  Additionally, I had already had emulate_write_cache turned on.
> 

This would also tend to indicate something specific to the backend.

> I did try with TCP_NODELAY enabled on the windows side.  The change of
> performance was insignificant (maybe a few percent faster, but easily
> within the noise of my measurement).
> 

Thanks for verifying that bit.

> I spent a little time trying to characterize difference between my
> system and the comparison NAS.  One thing I looked at, which may or
> may not indicate anything, was the Queue Depth in info for the LUN
> under configFS on both my system and the comparison system.  I put
> together a little script to cat the contents of info to a configFS
> entry every second or so while the copy was commencing.  On my system,
> I'd see the size of the "left" queue depth drop to around 103, at it's
> lowest point when writing from a Windows 7 system.  It gets down to 81
> when running from a WIndows 2008 system.  I have a Max of 128 entries
> listed.  On the comparison NAS system, the Max number of queue entries
> is 32, and the "left" queue depth never goes below 31.
> 
> While there is little that probably can be explicitly derived from
> this, it would suggest to me that the comparison NAS system is able to
> process the writes coming in from the iSCSI target interface much
> faster than I am doing in my system, as the number of queue entries
> does not seem to stack up.

Since you've already eliminated a different default_cmdsn_depth value,
it's likely not going to be a iscsi-target issue.  It's most likely an
issue of one of the software RAID configurations being faster for non
buffered IO.

>  Both the NAS and my system are using
> software RAID5 arrays, so I am now wondering if there is some
> interaction between my iSCSI setup and my mdraid setup that does not
> exist on the other NAS.  If incoming write requests over iSCSI were
> being mapped to more physical disk accesses in my system, compared to
> the other NAS, I could certainly see this as a cause for the lower
> results.  Does anyone have any thoughts on this?
> 

It depends on a number of things.  One is the physical queue depth for
each of the drives in the software raid.  Typical low end HBAs only
support queue_depth=1, which certainly has an effect on performance.
This value is located at /sys/class/scsi_device/$HCTL/device_queue_depth

Another factor can be if the individual drives in the raid have the
underlying WriteCacheEnable bit set.  This can be checked with 'sdparm
--get=WCE /dev/SDX', and set with 'sdparm --set=WCE /dev/sdX'.

Also, you'll want to understand the implications of using this, which is
that in case of a power failure there is no assurance the data in the
individual drive's cache has been written out to disk.

--nab

> Thanks,
> 
> Scott
> 
> On Mon, Aug 19, 2013 at 1:36 PM, Nicholas A. Bellinger
> <nab@xxxxxxxxxxxxxxx> wrote:
> > Hi Scott,
> >
> > On Sun, 2013-08-18 at 22:43 -0600, Scott Hallowell wrote:
> >> I have been looking into a performance concern with the iSCSI target
> >> as compared to CIFS running on the same server.  The expectation was
> >> that iSCSI should perform somewhat similarly to Samba.  The test
> >> environments are Window 7 & Windows 2008 Initiators connecting to a
> >> target running on a Debian wheezy release (a 3.2.46 kernel).  The test
> >> is a file copy from Windows to the Linux server.  The source volume is
> >> a software Raid 0 running on Windows.  The destination is an iSCSI LUN
> >> on a software Raid-5 array with 4 disks (2tb WD Reds).
> >>
> >> The write (from Windows to the CIFS/iSCSI volumes) is considerably
> >> slower (about half the rate) than the CIFS write.
> >>
> >> Specifically, I see 90+ MB/s writes with Samba on both the Windows 7
> >> and WIndows 2008 machines (using robocopy and 5.7GB of data spread
> >> unevenly across about 30 files).
> >>
> >> Performing the same tests with iSCSI and what I believe to be the
> >> 2.0.8 version of the Windows iSCSI initiator, I am getting closer to
> >> 40-45MB/s. on Windows 7 and 65 MB/s on Windows 2008.
> >>
> >> To test the theory the issue was a Windows issue, I connected the
> >> Windows 7 initiator to a commercial SAN and repeated the same tests.
> >> I got results of around 87MB/s.  The commercial SAN was configured
> >> similarly to my Linux server - Raid 5, 4 2tb WD Red disks, and has
> >> similar hardware (intel Atom processor, e1000e NICs, although less
> >> physical ram: 1GB vs 2GB).
> >>
> >> The results are fairly repeatable (+/- a couple of MB/s) and, at least
> >> with Windows 7, do not appear to suggest a specific issue with the
> >> Windows side of the equation.  The CIFS performance would suggest (to
> >> me, at least) that there is not a basic networking problem, either.
> >>
> >> I've tried a number of different things in an attempt to affect the
> >> iSCSI performance:  changing the disk scheduling (CFQ, Deadline, and
> >> noop), confirming write caching is on with hdparm, tweaking vm
> >> parameters in the kernel, tweaking TCP and adapter parameters (both in
> >> Linux and Windows), etc.  Interestingly, the performance numbers do
> >> not seem to change by more than +/- 10%, aggregate, with enough
> >> variability in the results that I'd suggest the changes are
> >> essentially in the noise.  I will note that I have not gone to
> >> 9000byte MTUs, but that seems irrelevant as the commercial SAN I
> >> compared against wasn't using that, either.
> >>
> >> I attempted to look at wireshark traces to identify any obvious
> >> patterns that might be had from the traffic.  Unfortunately, the
> >> amount of data required before I was able to start seeing repeatable
> >> differences in the aggregate rates (>400MB of file transfers) combined
> >> with offloading and the significant amount of caching in Windows has
> >> made such an analysis a bit tricky.
> >>
> >> It seems to me that there is something mis-configured in a very basic
> >> way which is significantly limits the performance by a far more
> >> significant extreme than can be explained by simple tuning, but I am
> >> at a loss to understand what it is.
> >>
> >> I am hoping that this has a ring of familiarity with someone who can
> >> give me some pointers on where I need to focus my attention.
> >>
> >
> > I recommend pursuing a few different things..
> >
> > First, you'll want to bump the default_cmdsn_depth from 16 to 64.  This
> > is the maximum number of commands allowed in flight (per session) at any
> > given time.  This can be changed with 'set attrib default_cmdsn_depth
> > 64' from within targetcli TPG context, or this can be changed on a per
> > NodeACL context basis if your not using TPG demo mode.
> >
> > The second is to try with write cache (buffered writes) enabled.  By
> > default both IBLOCK and FILEIO are running without write cache enabled,
> > to favor strict data integrity during target power loss over backend
> > performance.  IBLOCK itself can set the WriteCacheEnabled=1 bit via
> > emulate_write_cache, but all WRITEs are still going to be submitted +
> > completed to the underlying storage (which may also have a cache of it's
> > own) before acknowledgement.
> >
> > For FILEIO however, there is a buffered mode, which puts all WRITEs into
> > the buffer cache and acknowledges immediately, and let's VFS writeback
> > occur based upon /proc/sys/vm/dirty_[writeback,expire]_centisecs.  This
> > can be enabled during FILEIO created in targetcli by setting
> > 'buffered=true', which depending upon your version of the target will
> > automatically set 'emulate_write_cache=1'.
> >
> > You can verify that buffered mode is enabled in configfs with the 'Mode'
> > output, which depending on your kernel version should look something
> > like:
> >
> > # cat /sys/kernel/config/target/core/fileio_0/test/info
> > Status: DEACTIVATED  Execute/Max Queue Depth: 0/0  SectorSize: 512  MaxSectors: 1024
> >         TCM FILEIO ID: 0        File: /tmp/test  Size: 1073741824  Mode: Buffered
> >
> > The third thing is to enable TCP_NODELAY on the windows side, which does
> > not enable this bit by default.  This needs to be enabled on a per
> > interface basis in the registry, and should be easy enough to find on
> > google.
> >
> > Please let the list know your results.
> >
> > --nab
> >
> --
> To unsubscribe from this list: send the line "unsubscribe target-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html