Re: suse_enterprise_storage3_rbd_LIO_vmware_performance_bad

Nick Fisk <nick@xxxxxxxxxx> · Tue, 5 Jul 2016 09:54:12 +0100

> -----Original Message-----
> From: Alex Gorbachev [mailto:ag@xxxxxxxxxxxxxxxxxxx]
> Sent: 04 July 2016 22:00
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: Oliver Dzombic <info@xxxxxxxxxxxxxxxxx>; ceph-users <ceph-
> users@xxxxxxxxxxxxxx>; mq <maoqi1982@xxxxxxx>; Christian Balzer
> <chibi@xxxxxxx>
> Subject: Re: 
> suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
> 
> HI Nick,
> 
> 
> On Fri, Jul 1, 2016 at 2:11 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> 
> <snip>
> 
> > However, there are a number of pain points with iSCSI + ESXi + RBD and
> they all mainly centre on write latency. It seems VMFS was designed around
> the fact that Enterprise storage arrays service writes in 10-100us, whereas
> Ceph will service them in 2-10ms.
> >
> > 1. Thin Provisioning makes things slow. I believe the main cause is that
> when growing and zeroing the new blocks, metadata needs to be updated
> and the block zero'd. Both issue small IO which would normally not be a
> problem, but with Ceph it becomes a bottleneck to overall IO on the
> datastore.
> >
> > 2. Snapshots effectively turn all IO into 64kb IO's. Again a traditional SAN
> will coalesce these back into a stream of larger IO's before committing to
> disk. However with Ceph each IO takes 2-10ms and so everything seems
> slow. The future feature of persistent RBD cache may go a long way to
> helping with this.
> 
> Are you referring to ESXi snapshots?  Specifically, if a VM is running off a
> snapshot
> (https://kb.vmware.com/selfservice/microsites/search.do?language=en_US
> &cmd=displayKC&externalId=1015180),
> its IO will drop to 64KB "grains"?

Yep, that’s the one

> 
> > 3. >2TB VMDK's with snapshots use a different allocation mode, which
> happens in 4kb chunks instead of 64kb ones. This makes the problem 16
> times worse than above.
> >
> > 4. Any of the above will also apply when migrating machines around, so
> VM's can takes hours/days to move.
> >
> > 5. If you use FILEIO, you can't use thin provisioning. If you use BLOCKIO,
> you get thin provisioning, but no pagecache or readahead, so performance
> can nose dive if this is needed.
> 
> Would not FILEIO also leverage the Linux scheduler to do IO coalescing and
> help with (2) ?  Since FILEIO also uses the dirty flush mechanism in page cache
> (and makes IO somewhat crash-unsafe at the same time).

Turning off nv_cache and enabling write_through, should make this safe, but then you won't benefit from any writeback flushing.

> 
> > 6. iSCSI is very complicated (especially ALUA) and sensitive. Get used to
> seeing APD/PDL even when you think you have finally got everything
> working great.
> 
> We were used to seeing APD/PDL all the time with LIO, but pretty much have
> not seen any with SCST > 3.1.  Most of the ESXi problems are with just with
> high latency periods, which are not a problem for the hypervisor itself, but
> rather for the databases or applications inside VMs.

Yeah I think once you get SCST working, it's pretty stable. Certainly the best of the bunch. But I was more referring to "actually getting it working" :-) 

Particularly once you start introducing pacemaker, there are so many corner cases you need to take into account, that I'm still not 100% satisfied by the stability. Eg. Spent a long time working on the resource agents to make sure all the LUNS and Targets could shut down cleanly on a node. Depending on load and number of iscsi connections, it would randomly hang and then go into a APD state. Not saying it can't work, but compared to NFS it seems a lot more complicated to get it stable. 

> 
> Thanks,
> Alex
> 
> >
> >
> > Normal IO from eager zeroed VM's with no snapshots, however should
> perform ok. So depends what your workload is.
> >
> >
> > And then comes NFS. It's very easy to setup, very easy to configure for HA,
> and works pretty well overall. You don't seem to get any of the IO size
> penalties when using snapshots. If you mount with discard, thin provisioning
> is done by Ceph. You can defragment the FS on the proxy node and several
> other things that you can't do with VMFS. Just make sure you run the server
> in sync mode to avoid data loss.
> >
> > The only downside is that every IO causes an IO to the FS and one to the FS
> journal, so you effectively double your IO. But if your Ceph backend can
> support it, then it shouldn't be too much of a problem.
> >
> > Now to the original poster, assuming the iSCSI node is just kernel mounting
> the RBD, I would run iostat on it, to try and see what sort of latency you are
> seeing at that point. Also do the same with esxtop +u, and look at the write
> latency there, both whilst running the fio in the VM. This should hopefully let
> you see if there is just a gradual increase as you go from hop to hop or if
> there is an obvious culprit.
> >
> > Can you also confirm your kernel version?
> >
> > With 1GB networking I think you will struggle to get your write latency
> much below 10-15ms, but from your example ~30ms is still a bit high. I
> wonder if the default queue depths on your iSCSI target are too low as well?
> >
> > Nick
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of Oliver Dzombic
> >> Sent: 01 July 2016 09:27
> >> To: ceph-users@xxxxxxxxxxxxxx
> >> Subject: Re: 
> >> suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
> >>
> >> Hi,
> >>
> >> my experience:
> >>
> >> ceph + iscsi ( multipath ) + vmware == worst
> >>
> >> Better you search for another solution.
> >>
> >> vmware + nfs + vmware might have a much better performance.
> >>
> >> --------
> >>
> >> If you are able to get vmware run with iscsi and ceph, i would be
> >> >>very<< intrested in what/how you did that.
> >>
> >> --
> >> Mit freundlichen Gruessen / Best regards
> >>
> >> Oliver Dzombic
> >> IP-Interactive
> >>
> >> mailto:info@xxxxxxxxxxxxxxxxx
> >>
> >> Anschrift:
> >>
> >> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> >> 63571 Gelnhausen
> >>
> >> HRB 93402 beim Amtsgericht Hanau
> >> Geschäftsführung: Oliver Dzombic
> >>
> >> Steuer Nr.: 35 236 3622 1
> >> UST ID: DE274086107
> >>
> >>
> >> Am 01.07.2016 um 07:04 schrieb mq:
> >> > Hi list
> >> > I have tested suse enterprise storage3 using 2 iscsi  gateway
> >> > attached to  vmware. The performance is bad.  I have turn off  VAAI
> >> > following the
> >> >
> >>
> (https://kb.vmware.com/selfservice/microsites/search.do?language=en_U
> >> S
> >> > &cmd=displayKC&externalId=1033665)
> >> >
> >>
> <https://kb.vmware.com/selfservice/microsites/search.do?language=en_U
> >> S&cmd=displayKC&externalId=1033665%29>.
> >> > My cluster
> >> > 3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps (3*10K SAS, 1*480G  SSD)
> >> > per node, SSD as journal
> >> > 1 vmware node  2*E5-2620 64G , mem 2*1Gbps
> >> >
> >> > # ceph -s
> >> >     cluster 0199f68d-a745-4da3-9670-15f2981e7a15
> >> >      health HEALTH_OK
> >> >      monmap e1: 3 mons at
> >> >
> >>
> {node1=192.168.50.91:6789/0,node2=192.168.50.92:6789/0,node3=192.168.
> >> 5
> >> 0.93:6789/0}
> >> >             election epoch 22, quorum 0,1,2 node1,node2,node3
> >> >      osdmap e200: 9 osds: 9 up, 9 in
> >> >             flags sortbitwise
> >> >       pgmap v1162: 448 pgs, 1 pools, 14337 MB data, 4935 objects
> >> >             18339 MB used, 5005 GB / 5023 GB avail
> >> >                  448 active+clean
> >> >   client io 87438 kB/s wr, 0 op/s rd, 213 op/s wr
> >> >
> >> > sudo ceph osd tree
> >> > ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >> > -1 4.90581 root default
> >> > -2 1.63527     host node1
> >> > 0 0.54509         osd.0       up  1.00000          1.00000
> >> > 1 0.54509         osd.1       up  1.00000          1.00000
> >> > 2 0.54509         osd.2       up  1.00000          1.00000
> >> > -3 1.63527     host node2
> >> > 3 0.54509         osd.3       up  1.00000          1.00000
> >> > 4 0.54509         osd.4       up  1.00000          1.00000
> >> > 5 0.54509         osd.5       up  1.00000          1.00000
> >> > -4 1.63527     host node3
> >> > 6 0.54509         osd.6       up  1.00000          1.00000
> >> > 7 0.54509         osd.7       up  1.00000          1.00000
> >> > 8 0.54509         osd.8       up  1.00000          1.00000
> >> >
> >> >
> >> >
> >> > An linux vm in vmmare， running fio.  4k randwrite result just 64
> >> > IOPS lantency is high，dd test just 11MB／s.
> >> >
> >> > fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite
> >> > -size=100G -filename=/dev/sdb  -name="EBS 4KB randwrite test"
> >> > -iodepth=32
> >> > -runtime=60 EBS 4KB randwrite test: (g=0): rw=randwrite,
> >> > bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> >> > fio-2.0.13
> >> > Starting 1 thread
> >> > Jobs: 1 (f=1): [w] [100.0% done] [0K/131K/0K /s] [0 /32 /0  iops]
> >> > [eta 00m:00s] EBS 4KB randwrite test: (groupid=0, jobs=1): err= 0:
> >> > pid=6766: Wed Jun
> >> > 29 21:28:06 2016
> >> >   write: io=15696KB, bw=264627 B/s, iops=64 , runt= 60737msec
> >> >     slat (usec): min=10 , max=213 , avg=35.54, stdev=16.41
> >> >     clat (msec): min=1 , max=31368 , avg=495.01, stdev=1862.52
> >> >      lat (msec): min=2 , max=31368 , avg=495.04, stdev=1862.52
> >> >     clat percentiles (msec):
> >> >      |  1.00th=[    7],  5.00th=[    8], 10.00th=[    8], 20.00th=[    9],
> >> >      | 30.00th=[    9], 40.00th=[   10], 50.00th=[  198], 60.00th=[  204],
> >> >      | 70.00th=[  208], 80.00th=[  217], 90.00th=[  799], 95.00th=[ 1795],
> >> >      | 99.00th=[ 7177], 99.50th=[12649], 99.90th=[16712], 99.95th=[16712],
> >> >      | 99.99th=[16712]
> >> >     bw (KB/s)  : min=   36, max=11960, per=100.00%, avg=264.77,
> >> > stdev=1110.81
> >> >     lat (msec) : 2=0.03%, 4=0.23%, 10=40.93%, 20=0.48%, 50=0.03%
> >> >     lat (msec) : 100=0.08%, 250=39.55%, 500=5.63%, 750=2.91%,
> 1000=1.35%
> >> >     lat (msec) : 2000=4.03%, >=2000=4.77%
> >> >   cpu          : usr=0.02%, sys=0.22%, ctx=2973, majf=0,
> >> > minf=18446744073709538907
> >> >   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%,
> >> >>=64=0.0%
> >> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >> >>=64=0.0%
> >> >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%,
> >> > 64=0.0%,
> >> >>=64=0.0%
> >> >      issued    : total=r=0/w=3924/d=0, short=r=0/w=0/d=0
> >> >
> >> > Run status group 0 (all jobs):
> >> >   WRITE: io=15696KB, aggrb=258KB/s, minb=258KB/s, maxb=258KB/s,
> >> > mint=60737msec, maxt=60737msec
> >> >
> >> > Disk stats (read/write):
> >> >   sdb: ios=83/3921, merge=0/0, ticks=60/1903085, in_queue=1931694,
> >> > util=100.00%
> >> >
> >> > anyone can give me some suggestion to improve the performance ?
> >> >
> >> > Regards
> >> >
> >> > MQ
> >> >
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@xxxxxxxxxxxxxx
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com