On Tue, 2011-12-27 at 13:36 +0200, Pasi Kärkkäinen wrote: > On Thu, Dec 22, 2011 at 07:54:46PM -0500, John A. Sullivan III wrote: > > On Wed, 2011-10-05 at 15:54 -0400, Adam Chasen wrote: > > > John, > > > I am limited in a similar fashion. I would much prefer to use multibus > > > multipath, but was unable to achieve bandwidth which would exceed a > > > single link even though it was spread over the 4 available links. Were > > > you able to gain even a similar performance of the RAID0 setup with > > > the multibus multipath? > > > > > > Thanks, > > > Adam > > <snip> > > We just ran a quick benchmark before optimizing. Using multibus rather > > than RAID0 with four GbE NICs, and testing with a simple cat /dev/zero > > > zeros, we hit 3.664 Gbps! > > > > This is still on CentOS 5.4 so we are not able to play with > > rr_min_io_rq. We have not yet activated jumbo frames. We are also > > thinking of using SFQ as a qdisc instead of the default pfifo_fast. So, > > we think we can make it go even faster. > > > > We are delighted to be achieving this with multibus rather than RAID0 as > > it means we can take transactionally consistent snapshots on the SAN. > > > > Many thanks to whomever pointed out that tag queueing should solve the > > 4KB block size latency problem. The problem turned out to not be > > latency as we were told but simply an under resources SAN. We brought > > in new Nexenta SANs with much more RAM and they are flying - John > > > > Hey, > > Can you please post your multipath configuration ? > Just for reference for future people googling for this :) > > -- Pasi > <snip> Sure although I would be a bit careful. I think there are a few things we need to tweak in it and the lead engineer on the product and I just haven't had the time to go over it. It is also based upon CentOS 5.4 so we do not have rr_min_io_rq. We are a moderately secure environment so I might need to scrub a bit of data: multipath.conf blacklist { # devnode "*" # sdb wwid SATA_ST3250310NS_9XX0LYYY #sda wwid SATA_ST3250310NS_9XX0LZZZ # The above does not seem to be working thus we will do devnode "^sd[ab]$" # This is usually a bad idea as the device names can change # However, since we add our iSCSI devices long after boot, I think we are safe } defaults { udev_dir /dev polling_interval 5 selector "round-robin 0" path_grouping_policy multibus getuid_callout "/sbin/scsi_id -g -u -s /block/%n" prio_callout "/bin/bash /sbin/mpath_prio_ssi %n" # This needs to be cleaned up prio_callout /bin/true path_checker directio rr_min_io 100 max_fds 8192 rr_weight uniform failback immediate no_path_retry fail # user_friendly_names yes } multipaths { multipath { wwid aaaaaaaaaa53f0d0000004e81f27d0001 alias isda } multipath { wwid aaaaaaaaaa53f0d0000004e81f2910002 alias isdb } multipath { wwid aaaaaaaaaa53f0d0000004e81f2ab0003 alias isdc } multipath { wwid aaaaaaaaaa53f0d0000004e81f2c10004 alias isdd } } devices { device { vendor "NEXENTA" product "COMSTAR" getuid_callout "/sbin/scsi_id -g -u -s /block/%n" features "0" hardware_handler "0" } } Other miscellaneous settings: # Some optimizations for the SAN network ip link set eth0 txqlen 2000 ip link set eth1 txqlen 2000 ip link set eth2 txqlen 2000 ip link set eth3 txqlen 2000 The more we read about and test bufferbloat (http://www.bufferbloat.net/projects/bloat), the more we are thinking of actually dramatically reducing these buffers as it is quite possible for one new iSCSI conversation to become backlogged behind another and, I suspect, that could also wreak havoc on command reordering if we are doing round robin around the interfaces. We are also thinking of changing the queuing discipline from the default pfifo_fast. Since it is all the same traffic, there is no need to band it like pfifo_fast does by examining the TOS bits. A regular fifo qdisc might be a hair faster. On the other hand, we might want to go with SFQ so that one heavy iSCSI conversation cannot starve others or cause them to not quickly accelerate TCP slow start. multipath -F #flush multipath sleep 2 service multipathd start sleep 2 blockdev --setra 1024 /dev/mapper/isda blockdev --setra 1024 /dev/mapper/isdb blockdev --setra 1024 /dev/mapper/isdc blockdev --setra 1024 /dev/mapper/isdd mount -o defaults,noatime /dev/mapper/id02sdd /backups # Note the noatime >From sysctl.conf: # Controls tcp maximum receive window size #net.core.rmem_max = 409600 #net.core.rmem_max = 8738000 net.core.rmem_max = 16777216 net.ipv4.tcp_rmem = 8192 873800 16777216 # Controls tcp maximum send window size #net.core.wmem_max = 409600 #net.core.wmem_max = 6553600 net.core.wmem_max = 16777216 net.ipv4.tcp_wmem = 4096 655360 16777216 # Controls disabling Nagle algorithm and delayed acks net.ipv4.tcp_low_latency=1 net.core.netdev_max_backlog = 2000 # Controls the use of TCP syncookies net.ipv4.tcp_syncookies = 1 # Controls the maximum size of a message, in bytes kernel.msgmnb = 65536 # Controls the default maxmimum size of a mesage queue kernel.msgmax = 65536 # Controls the maximum shared segment size, in bytes kernel.shmmax = 68719476736 # Controls the maximum number of shared memory segments, in pages kernel.shmall = 4294967296 # Controls when we call for more entropy # Since these systems have no mouse or keyboard and Linux no longer uses network I/O, # we are regularly running low on entropy kernel.random.write_wakeup_threshold = 1024 # Not really needed for iSCSI - just an interesting setting we use in conjunction with haveged to address the problem of lack of entropy on headless systems We have not yet re-enabled jumbo packets as that actually reduced throughput in the past but that may have been related to the lack of resources in the original unit. Hope this helps. We are not experts so, if someone sees something we can tweak, please point it out - John -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel