On Tue, 2009-03-24 at 20:17 +0200, Pasi Kärkkäinen wrote: > On Tue, Mar 24, 2009 at 01:30:10PM -0400, John A. Sullivan III wrote: > > Thanks very much, again, and, again, I'll reply in the text - John > > > > Np :) > > > > > > > iirc 2810 does not have very big buffers per port, so you might be better > > > using flow control instead of jumbos.. then again I'm not sure how good flow > > > control implementation HP has? > > > > > > The whole point of flow control is to prevent packet loss/drop.. this happens > > > with sending pause frames before the port buffers get full. If port buffers > > > get full then the switch doesn't have any other option than to drop the > > > packets.. and this causes tcp-retransmits -> causes delay and tcp slows down > > > to prevent further packet drops. > > > > > > flow control "pause frames" cause less delay than tcp-retransmits. > > > > > > Do you see tcp retransmits with "netstat -s" ? Check both the target and the initiators. > > Thankfully this is an area of some expertise for me (unlike disk I/O - > > obviously ;) ). We have been pretty thorough about checking the > > network path. We've not seen any upper layer retransmission or buffer > > overflows. > > Good :) > > > > > > > > What kind of performance do you get using just a single iscsi session (and > > > > > > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem > > > > > > > directly on top of the iscsi /dev/sd? device. > > > > > > Miserable - same roughly 12 MB/s. > > > > > > > > > > OK, Here's your problem. Was this btw reads or writes? Did you tune > > > > > readahead-settings? > > > > 12MBps is sequential reading but sequential writing is not much > > > > different. We did tweak readahead to 1024. We did not want to go much > > > > larger in order to maintain balance with the various data patterns - > > > > some of which are random and some of which may not read linearly. > > > > > > I did some benchmarking earlier between two servers; other one running ietd > > > target with 'nullio' and other running open-iscsi initiator. Both using a single gigabit NIC. > > > > > > I remember getting very close to full gigabit speed at least with bigger > > > block sizes. I can't remember how much I got with 4 kB blocks. > > > > > > Those tests were made with dd. > > Yes, if we use 64KB blocks, we can saturate a Gig link. With larger > > sizes, we can push over 3 Gpbs over the four gig links in the test > > environment. > > That's good. > > > > > > > nullio target is a good way to benchmark your network and initiator and > > > verify everything is correct. > > > > > > Also it's good to first test for example with FTP and Iperf to verify > > > network is working properly between target and the initiator and all the > > > other basic settings are correct. > > We did flood ping the network and had all interfaces operating at near > > capacity. The network itself looks very healthy. > > Ok. > > > > > > > Btw have you configured tcp stacks of the servers? Bigger default tcp window > > > size, bigger maximun tcp window size etc.. > > Yep, tweaked transmit queue length, receive and transmit windows, net > > device backlogs, buffer space, disabled nagle, and even played with the > > dirty page watermarks. > > That's all taken care of then :) > > Also on the target? > > > > > > > > > > > > > > Can paste your iSCSI session settings negotiated with the target? > > > > Pardon my ignorance :( but, other than packet traces, how do I show the > > > > final negotiated settings? > > > > > > Try: > > > > > > iscsiadm -i -m session > > > iscsiadm -m session -P3 > > > > > Here's what it says. Pretty much as expected. We are using COMSTAR on > > the target and took some traces to see what COMSTAR was expecting. We > > set the open-iscsi parameters to match: > > > > Current Portal: 172.x.x.174:3260,2 > > Persistent Portal: 172.x.x.174:3260,2 > > ********** > > Interface: > > ********** > > Iface Name: default > > Iface Transport: tcp > > Iface Initiatorname: iqn.2008-05.biz.ssi:vd-gen > > Iface IPaddress: 172.x.x.162 > > Iface HWaddress: default > > Iface Netdev: default > > SID: 32 > > iSCSI Connection State: LOGGED IN > > iSCSI Session State: LOGGED_IN > > Internal iscsid Session State: NO CHANGE > > ************************ > > Negotiated iSCSI params: > > ************************ > > HeaderDigest: None > > DataDigest: None > > MaxRecvDataSegmentLength: 131072 > > MaxXmitDataSegmentLength: 8192 > > FirstBurstLength: 65536 > > MaxBurstLength: 524288 > > ImmediateData: Yes > > InitialR2T: Yes > > I guess InitialR2T could be No for a bit better performance? > > MaxXmitDataSegmentLength looks small? > > > > > > You should be able to get many times the throughput you get now.. just with > > > > > a single path/session. > > > > > > > > > > What kind of latency do you have from the initiator to the target/storage? > > > > > > > > > > Try with for example 4 kB ping: > > > > > ping -s 4096 <ip_of_the_iscsi_target> > > > > We have about 400 micro seconds - that seems a bit high :( > > > > rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms > > > > > > > > > > Yeah.. that's a bit high. > > Actually, with more testing, we're seeing it stretch up to over 700 > > micro-seconds. I'll attach a raft of data I collected at the end of > > this email. > > Ok. > > > > I think Ross suggested in some other thread the following settings for e1000 > > > NICs: > > > > > > "Set the e1000s InterruptThrottleRate=1 and their TxRingBufferSize=4096 > > > and RxRingBufferSize=4096 (verify those option names with a modinfo) > > > and add those to modprobe.conf." > > We did try playing with the ring buffer but to no avail. Modinfo does > > not seem to display the current settings. We did try playing with > > setting the InterruptThrottleRate to 1 but again to no avail. As I'll > > mention later, I suspect the issue might be the opensolaris based > > target. > > Could be.. > > > > > > > > I would love to use larger block sizes as you suggest in your other > > > > email but, on AMD64, I believe we are stuck with 4KB. I've not seen any > > > > way to change it and would gladly do so if someone knows how. > > > > > > > > > > Are we talking about filesystem block sizes? That shouldn't be a problem if > > > your application uses larger blocksizes for read/write operations.. > > > > > Yes, file system block size. When we try rough, end user style tests, > > e.g., large file copies, we seem to get the performance indicated by 4KB > > blocks, i.e., lousy! > > Yep.. try upgrading to 10 Gbit Ethernet for much lower latency ;) > > > > Try for example with: > > > dd if=/dev/zero of=/iscsilun/file.bin bs=1024k count=1024 > > Large block sizes can make the system truly fly so we suspect you are > > absolutely correct about latency being the issue. We did do our testing > > with raw interfaces by the way. > > Ok. > > > <snip> > > I did a little digging and calculating and here is what I came up with > > and sent to Nexenta. Please tell me if I am on the right track. > > > > I am using jumbo frames and should be able to get 2 4KB blocks > > per frame. Total size should be 8192 + 78 (TCP + IP + Ethernet + CRC > > -oops we need to add iSCSI -what size is the iSCSI header?) + 12 > > (interframe gap) = 8282 bytes. Transmission latency should be 8282 * > > 8 / 1,000,000,000 = 66.3 micro-seconds. Switch latency is 5.7 > > microseconds so let's say network latency is 72 - well let's say 75 > > micro-seconds. The only additional latency should be added by the > > network stacks on the target and initiator. > > > > Current round trip latency between the initiator (Linux) and target > > (Nexenta) is around 400 micro-seconds and fluctuates significantly: > > > > Hmm . . this is worse than the last test: > > PING 172.30.13.158 (172.30.13.158) 8192(8220) bytes of data. > > > --- 172.30.13.158 ping statistics --- > > 33 packets transmitted, 33 received, 0% packet loss, time 32000ms > > rtt min/avg/max/mdev = 0.399/0.574/1.366/0.161 ms > > > > There is nothing going on in the network. So we are seeing 574 > > micro-seconds total with only 150 micro-seconds attributed to > > transmission. And we see a wide variation in latency. > > > > Yeah something wrong there.. How much latency do you have between different > initiator machines? > > > I then tested the latency between interfaces on the initiator and the > > target. Here is what I get for internal latency on the Linux initiator: > > PING 172.30.13.18 (172.30.13.18) from 172.30.13.146 : 8192(8220) bytes > > of data. > > --- 172.30.13.18 ping statistics --- > > 29 packets transmitted, 29 received, 0% packet loss, time 27999ms > > rtt min/avg/max/mdev = 0.017/0.018/0.033/0.005 ms > > > > A very consistent 18 micro-seconds. > > > > Yeah, I take it that's not through network/switch :) > > > Here is what I get from the Z200: > > root@disk01:/etc# ping -s -i e1000g6 172.30.13.190 4096 > > PING 172.30.13.190: 4096 data bytes > > ----172.30.13.190 PING Statistics---- > > 31 packets transmitted, 31 packets received, 0% packet loss > > round-trip (ms) min/avg/max/stddev = 0.042/0.066/0.104/0.019 > > > > Big difference.. I'm not familiar with Solaris, so can't really suggest what > to tune there.. > > > Notice it is several times longer latency with much wider variation. > > How to we tune the opensolaris network stack to reduce it's latency? I'd > > really like to improve the individual user experience. I can tell them > > it's like commuting to work on the train instead of the car during rush > > hour - faster when there's lots of traffic but slower when there is not, > > but they will judge the product by their individual experiences more > > than their collective experiences. Thus, I really want to improve the > > individual disk operation throughput. > > > > Latency seems to be our key. If I can add only 20 micro-seconds of > > latency from initiator and target each, that would be roughly 200 micro > > seconds. That would almost triple the throughput from what we are > > currently seeing. > > > > Indeed :) > > > Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris. > > I can certainly learn but am I headed in the right direction or is this > > direction of investigation misguided? Thanks - John > > > > Low latency is the key for good (iSCSI) SAN performance, as it directly > gives you more (possible) IOPS. > > Other option is to configure software/settings so that there are multiple > outstanding IO's on the fly.. then you're not limited with the latency (so much). > > -- Pasi <snip> Ross has been of enormous help offline. Indeed, disabling jumbo packets produced an almost 50% increase in single threaded throughput. We are pretty well set although still a bit disappointed in the latency we are seeing in opensolaris and have escalated to the vendor about addressing it. The once piece which is still a mystery is why using four targets on four separate interfaces striped with dmadm RAID0 does not produce an aggregate of slightly less than four times the IOPS of a single target on a single interface. This would not seem to be the out of order SCSI command problem of multipath. One of life's great mysteries yet to be revealed. Thanks again, all - John -- John A. Sullivan III Open Source Development Corporation +1 207-985-7880 jsullivan@xxxxxxxxxxxxxxxxxxx http://www.spiritualoutreach.com Making Christianity intelligible to secular society -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel