Re: Ceph RBD LIO ESXi Advice?

Nick Fisk <nick@xxxxxxxxxx> · Sat, 7 Nov 2015 15:59:29 -0000

Hi Timofey,

You are most likely experiencing the effects of Ceph's write latency in combination with the sync write behaviour of ESXi. You will probably struggle to get much under 2ms write latency with Ceph, assuming a minimum of 2 copies in Ceph. This will limit you to around 500iops for a QD of 1. Because of this you will also experience slow file/VM copies, as ESXi moves the blocks of data around in 64kb sync IO's. 500x64kb = ~30MB/s.

Moving to 10GB end to end may get you a reasonable boost in performance as you will be removing a 1ms or so of latency from the network for each write. Also search the mailing list for small performance tweaks you can do, like disabling logging.

Other than that the only thing I have found that has chance of giving you performance similar to other products and/or legacy SAN's is to use some sort of RBD caching with something like flashcache/enhanceio/bcache o nyour proxy nodes. However this brings its on challenges and I still haven't got to a point that I'm happy to deploy it.

I'm surprised you are also not seeing LIO hangs, which several people including me experience when using RBD+LIO+ESXi, although I haven't checked recently to see if this is now working better. I would be interesting in hearing your feedback on this. They normally manifest themselves when an OSD drops out and IO is suspended for more than 5-10s.

Sorry I couldn't be of more help.

Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Timofey Titovets
> Sent: 07 November 2015 11:44
> To: ceph-users@xxxxxxxxxxxxxx
> Subject:  Ceph RBD LIO ESXi Advice?
> 
> Hi List,
> I Searching for advice from somebody, who use Legacy client like ESXi with
> Ceph
> 
> I try to build High-performance fault-tolerant storage with Ceph 0.94
> 
> In production i have 50+ TB of VMs (~800 VMs)
> 8 NFS servers each:
> 2xIntel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz 12xSeagate ST2000NM0023
> 1xLSI Nytro™ MegaRAID® NMR 8110-4i
> 96 GB of RAM
> 4x 1 GBE links in Balance-ALB mode (I don't have problem with network
> throughput)
> 
> Now in lab. i have build 3 node cluster like:
> Kernel 4.2
> Intel(R) Xeon(R) CPU 5130  @ 2.00GHz
> 16 Gb of RAM
> 6xSeagate ST2000NM0033
> 2x 1GBE in Balance-ALB
> i.e. each node is a MON and 6 OSDs
> 
> 
> Config like:
> osd journal size = 16384
> osd pool default size = 2
> osd pool default min size = 2
> osd pool default pg num = 256
> osd pool default pgp num = 256
> osd crush chooseleaf type = 1
> filestore max sync interval = 180
> 
> For attach RBD Storage to ESXi i create a 2 VMs:
> 2 cores
> 2 GB RAM
> Kernel 4.3
> Each vm map big RBD volume and proxy it by LIO to ESXi ESXi see VMs like
> iSCSI Target server in Active/Passive mode
> 
> RBD created with --image-shared and --image-format 2 keys
> 
> My Questions:
> 1. I have architecture problem?
> 2. May be you have ideas?
> 3. ESXi working with iSCSI storage very slow(30-60 Mb/s read/write), but this
> is can be a ESXi problem, later i will test this with more modern Hypervisor
> server 4. Proxy VMs working not too bad with storage, but fio shows too low
> numbers:
> [global]
> size=128g   # File size
> filename=/storage/testfile.fio
> numjobs=1   # One tread
> runtime=600 # 10m for each test
> ioengine=libaio # Use async io
>         # Pseude random data, can be compressed by 15%
> buffer_compress_percentage=15
> overwrite=1 # Overwrite data in file
> end_fsync=1 # Doing fsync, at the and of test, for sync OS buffers
> direct=1    # Bypass OS cache
> startdelay=30   # Pause between tests
> bs=4k       # Block size for io requests
> iodepth=64  # Count of IO request, what can be requested asynchronously
> rw=randrw   # Random Read/Write
> ####################################################
> # IOMeter defines the server loads as the following:
> # iodepth=1   # Linear
> # iodepth=4   # Very Light
> # iodepth=8   # Light
> # iodepth=64  # Moderate
> # iodepth=256 # Heavy
> ####################################################
> [Disk-4k-randomrw-depth-1]
> rwmixread=50
> iodepth=1
> stonewall # Do each test separated
> ####################################################
> [Disk-4k-randomrw-depth-8]
> rwmixread=50
> iodepth=8
> stonewall
> ####################################################
> [Disk-4k-randomrw-depth-64]
> rwmixread=50
> stonewall
> ####################################################
> [Disk-4k-randomrw-depth-256]
> rwmixread=50
> iodepth=256
> stonewall
> ####################################################
> [Disk-4k-randomrw-depth-512]
> rwmixread=50
> iodepth=512
> stonewall
> ####################################################
> [Disk-4k-randomrw-depth-1024]
> rwmixread=50
> iodepth=1024
> stonewall
> -- cut --
> 
> RBD-LIO-PROXY:
> -- cut --
> Disk-4k-randomrw-depth-512: (groupid=4, jobs=1): err= 0: pid=10601:
> Sat Nov  7 13:59:49 2015
>   read : io=770772KB, bw=1282.1KB/s, iops=320, runt=600813msec
>     clat (msec): min=141, max=8456, avg=715.87, stdev=748.55
>   write: io=769400KB, bw=1280.7KB/s, iops=320, runt=600813msec
>     clat (msec): min=158, max=9862, avg=878.73, stdev=905.47
> -- cut --
> One of node in Raid0:
> Disk-4k-randomrw-depth-512: (groupid=4, jobs=1): err= 0: pid=4652: Fri Oct
> 30 16:29:00 2015
>   read : io=258500KB, bw=2128.4KB/s, iops=532, runt=121455msec
>     clat (msec): min=1, max=3983, avg=484.80, stdev=478.39
>   write: io=257568KB, bw=2120.8KB/s, iops=530, runt=121455msec
>     clat (usec): min=217, max=3976.1K, avg=478327.33, stdev=480695.05
> -- cut --
> 
> By me expirience with ScaleIO, must get on proxy node numbers like ~1000
> IOPs
> 
> I can provide full  FIO config and logs if it needed, i just try to fix perfomance
> problem and search for advice
> 
> 5. May be i must change my FIO config?
> 6. May be i missing something?
> 
> If someone have a expirience with similar solutions, story and links a
> welcomed -.-
> 
> --
> Have a nice day,
> Timofey.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com