> -----Original Message----- > From: Alex Gorbachev [mailto:ag@xxxxxxxxxxxxxxxxxxx] > Sent: 04 July 2016 22:00 > To: Nick Fisk <nick@xxxxxxxxxx> > Cc: Oliver Dzombic <info@xxxxxxxxxxxxxxxxx>; ceph-users <ceph- > users@xxxxxxxxxxxxxx>; mq <maoqi1982@xxxxxxx>; Christian Balzer > <chibi@xxxxxxx> > Subject: Re: > suse_enterprise_storage3_rbd_LIO_vmware_performance_bad > > HI Nick, > > > On Fri, Jul 1, 2016 at 2:11 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > <snip> > > > However, there are a number of pain points with iSCSI + ESXi + RBD and > they all mainly centre on write latency. It seems VMFS was designed around > the fact that Enterprise storage arrays service writes in 10-100us, whereas > Ceph will service them in 2-10ms. > > > > 1. Thin Provisioning makes things slow. I believe the main cause is that > when growing and zeroing the new blocks, metadata needs to be updated > and the block zero'd. Both issue small IO which would normally not be a > problem, but with Ceph it becomes a bottleneck to overall IO on the > datastore. > > > > 2. Snapshots effectively turn all IO into 64kb IO's. Again a traditional SAN > will coalesce these back into a stream of larger IO's before committing to > disk. However with Ceph each IO takes 2-10ms and so everything seems > slow. The future feature of persistent RBD cache may go a long way to > helping with this. > > Are you referring to ESXi snapshots? Specifically, if a VM is running off a > snapshot > (https://kb.vmware.com/selfservice/microsites/search.do?language=en_US > &cmd=displayKC&externalId=1015180), > its IO will drop to 64KB "grains"? Yep, that’s the one > > > 3. >2TB VMDK's with snapshots use a different allocation mode, which > happens in 4kb chunks instead of 64kb ones. This makes the problem 16 > times worse than above. > > > > 4. Any of the above will also apply when migrating machines around, so > VM's can takes hours/days to move. > > > > 5. If you use FILEIO, you can't use thin provisioning. If you use BLOCKIO, > you get thin provisioning, but no pagecache or readahead, so performance > can nose dive if this is needed. > > Would not FILEIO also leverage the Linux scheduler to do IO coalescing and > help with (2) ? Since FILEIO also uses the dirty flush mechanism in page cache > (and makes IO somewhat crash-unsafe at the same time). Turning off nv_cache and enabling write_through, should make this safe, but then you won't benefit from any writeback flushing. > > > 6. iSCSI is very complicated (especially ALUA) and sensitive. Get used to > seeing APD/PDL even when you think you have finally got everything > working great. > > We were used to seeing APD/PDL all the time with LIO, but pretty much have > not seen any with SCST > 3.1. Most of the ESXi problems are with just with > high latency periods, which are not a problem for the hypervisor itself, but > rather for the databases or applications inside VMs. Yeah I think once you get SCST working, it's pretty stable. Certainly the best of the bunch. But I was more referring to "actually getting it working" :-) Particularly once you start introducing pacemaker, there are so many corner cases you need to take into account, that I'm still not 100% satisfied by the stability. Eg. Spent a long time working on the resource agents to make sure all the LUNS and Targets could shut down cleanly on a node. Depending on load and number of iscsi connections, it would randomly hang and then go into a APD state. Not saying it can't work, but compared to NFS it seems a lot more complicated to get it stable. > > Thanks, > Alex > > > > > > > Normal IO from eager zeroed VM's with no snapshots, however should > perform ok. So depends what your workload is. > > > > > > And then comes NFS. It's very easy to setup, very easy to configure for HA, > and works pretty well overall. You don't seem to get any of the IO size > penalties when using snapshots. If you mount with discard, thin provisioning > is done by Ceph. You can defragment the FS on the proxy node and several > other things that you can't do with VMFS. Just make sure you run the server > in sync mode to avoid data loss. > > > > The only downside is that every IO causes an IO to the FS and one to the FS > journal, so you effectively double your IO. But if your Ceph backend can > support it, then it shouldn't be too much of a problem. > > > > Now to the original poster, assuming the iSCSI node is just kernel mounting > the RBD, I would run iostat on it, to try and see what sort of latency you are > seeing at that point. Also do the same with esxtop +u, and look at the write > latency there, both whilst running the fio in the VM. This should hopefully let > you see if there is just a gradual increase as you go from hop to hop or if > there is an obvious culprit. > > > > Can you also confirm your kernel version? > > > > With 1GB networking I think you will struggle to get your write latency > much below 10-15ms, but from your example ~30ms is still a bit high. I > wonder if the default queue depths on your iSCSI target are too low as well? > > > > Nick > > > >> -----Original Message----- > >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > >> Of Oliver Dzombic > >> Sent: 01 July 2016 09:27 > >> To: ceph-users@xxxxxxxxxxxxxx > >> Subject: Re: > >> suse_enterprise_storage3_rbd_LIO_vmware_performance_bad > >> > >> Hi, > >> > >> my experience: > >> > >> ceph + iscsi ( multipath ) + vmware == worst > >> > >> Better you search for another solution. > >> > >> vmware + nfs + vmware might have a much better performance. > >> > >> -------- > >> > >> If you are able to get vmware run with iscsi and ceph, i would be > >> >>very<< intrested in what/how you did that. > >> > >> -- > >> Mit freundlichen Gruessen / Best regards > >> > >> Oliver Dzombic > >> IP-Interactive > >> > >> mailto:info@xxxxxxxxxxxxxxxxx > >> > >> Anschrift: > >> > >> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 > >> 63571 Gelnhausen > >> > >> HRB 93402 beim Amtsgericht Hanau > >> Geschäftsführung: Oliver Dzombic > >> > >> Steuer Nr.: 35 236 3622 1 > >> UST ID: DE274086107 > >> > >> > >> Am 01.07.2016 um 07:04 schrieb mq: > >> > Hi list > >> > I have tested suse enterprise storage3 using 2 iscsi gateway > >> > attached to vmware. The performance is bad. I have turn off VAAI > >> > following the > >> > > >> > (https://kb.vmware.com/selfservice/microsites/search.do?language=en_U > >> S > >> > &cmd=displayKC&externalId=1033665) > >> > > >> > <https://kb.vmware.com/selfservice/microsites/search.do?language=en_U > >> S&cmd=displayKC&externalId=1033665%29>. > >> > My cluster > >> > 3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps (3*10K SAS, 1*480G SSD) > >> > per node, SSD as journal > >> > 1 vmware node 2*E5-2620 64G , mem 2*1Gbps > >> > > >> > # ceph -s > >> > cluster 0199f68d-a745-4da3-9670-15f2981e7a15 > >> > health HEALTH_OK > >> > monmap e1: 3 mons at > >> > > >> > {node1=192.168.50.91:6789/0,node2=192.168.50.92:6789/0,node3=192.168. > >> 5 > >> 0.93:6789/0} > >> > election epoch 22, quorum 0,1,2 node1,node2,node3 > >> > osdmap e200: 9 osds: 9 up, 9 in > >> > flags sortbitwise > >> > pgmap v1162: 448 pgs, 1 pools, 14337 MB data, 4935 objects > >> > 18339 MB used, 5005 GB / 5023 GB avail > >> > 448 active+clean > >> > client io 87438 kB/s wr, 0 op/s rd, 213 op/s wr > >> > > >> > sudo ceph osd tree > >> > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > >> > -1 4.90581 root default > >> > -2 1.63527 host node1 > >> > 0 0.54509 osd.0 up 1.00000 1.00000 > >> > 1 0.54509 osd.1 up 1.00000 1.00000 > >> > 2 0.54509 osd.2 up 1.00000 1.00000 > >> > -3 1.63527 host node2 > >> > 3 0.54509 osd.3 up 1.00000 1.00000 > >> > 4 0.54509 osd.4 up 1.00000 1.00000 > >> > 5 0.54509 osd.5 up 1.00000 1.00000 > >> > -4 1.63527 host node3 > >> > 6 0.54509 osd.6 up 1.00000 1.00000 > >> > 7 0.54509 osd.7 up 1.00000 1.00000 > >> > 8 0.54509 osd.8 up 1.00000 1.00000 > >> > > >> > > >> > > >> > An linux vm in vmmare, running fio. 4k randwrite result just 64 > >> > IOPS lantency is high,dd test just 11MB/s. > >> > > >> > fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite > >> > -size=100G -filename=/dev/sdb -name="EBS 4KB randwrite test" > >> > -iodepth=32 > >> > -runtime=60 EBS 4KB randwrite test: (g=0): rw=randwrite, > >> > bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 > >> > fio-2.0.13 > >> > Starting 1 thread > >> > Jobs: 1 (f=1): [w] [100.0% done] [0K/131K/0K /s] [0 /32 /0 iops] > >> > [eta 00m:00s] EBS 4KB randwrite test: (groupid=0, jobs=1): err= 0: > >> > pid=6766: Wed Jun > >> > 29 21:28:06 2016 > >> > write: io=15696KB, bw=264627 B/s, iops=64 , runt= 60737msec > >> > slat (usec): min=10 , max=213 , avg=35.54, stdev=16.41 > >> > clat (msec): min=1 , max=31368 , avg=495.01, stdev=1862.52 > >> > lat (msec): min=2 , max=31368 , avg=495.04, stdev=1862.52 > >> > clat percentiles (msec): > >> > | 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9], > >> > | 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 198], 60.00th=[ 204], > >> > | 70.00th=[ 208], 80.00th=[ 217], 90.00th=[ 799], 95.00th=[ 1795], > >> > | 99.00th=[ 7177], 99.50th=[12649], 99.90th=[16712], 99.95th=[16712], > >> > | 99.99th=[16712] > >> > bw (KB/s) : min= 36, max=11960, per=100.00%, avg=264.77, > >> > stdev=1110.81 > >> > lat (msec) : 2=0.03%, 4=0.23%, 10=40.93%, 20=0.48%, 50=0.03% > >> > lat (msec) : 100=0.08%, 250=39.55%, 500=5.63%, 750=2.91%, > 1000=1.35% > >> > lat (msec) : 2000=4.03%, >=2000=4.77% > >> > cpu : usr=0.02%, sys=0.22%, ctx=2973, majf=0, > >> > minf=18446744073709538907 > >> > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%, > >> >>=64=0.0% > >> > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >> >>=64=0.0% > >> > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, > >> > 64=0.0%, > >> >>=64=0.0% > >> > issued : total=r=0/w=3924/d=0, short=r=0/w=0/d=0 > >> > > >> > Run status group 0 (all jobs): > >> > WRITE: io=15696KB, aggrb=258KB/s, minb=258KB/s, maxb=258KB/s, > >> > mint=60737msec, maxt=60737msec > >> > > >> > Disk stats (read/write): > >> > sdb: ios=83/3921, merge=0/0, ticks=60/1903085, in_queue=1931694, > >> > util=100.00% > >> > > >> > anyone can give me some suggestion to improve the performance ? > >> > > >> > Regards > >> > > >> > MQ > >> > > >> > > >> > > >> > > >> > _______________________________________________ > >> > ceph-users mailing list > >> > ceph-users@xxxxxxxxxxxxxx > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com