Re: Ceph RBD LIO ESXi Advice?

Timofey Titovets <nefelim4ag@xxxxxxxxx> · Mon, 9 Nov 2015 22:50:15 +0300

Alex, are you use ESXi?
If yes, you use iSCSI Software adapter?
If yes, you use active/passive, fixed, RoundRobin MPIO?
Do you tune something on Initiator side?

If possible can you give more details? Please

2015-11-09 17:41 GMT+03:00 Timofey Titovets <nefelim4ag@xxxxxxxxx>:
> Great thanks, Alex, you give me a hope, i'll try SCST later in
> configuration what you suggest
>
> 2015-11-09 16:25 GMT+03:00 Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>:
>> Hi Timofey,
>>
>> With Nick's, Jan's, RedHat's and others' help we have a stable and, in my
>> best judgement, well performing system using SCST as the iSCSI delivery
>> framework.  SCST allows the use of Linux page cache when utilizing the
>> vdisk_fileio backend.  LIO should be able to do this to using FILEIO
>> backstore and the block device name as file name, but I have not tried that
>> due to having switched to SCST for stability.
>>
>> The page cache will improve latency due to the reads and writes first
>> occurring in RAM.  Naturally, all the usual considerations apply as to the
>> loss of dirty pages on machine crash.  So tuning the vm.dirty* parameters is
>> quite important.
>>
>> This setting was critically important to avoid hangs and major issues due to
>> some problem with XFS and page cache on OSD nodes:
>>
>> sysctl vm.min_free_kbytes=1048576
>>
>> (reserved memory when using vm.swappiness = 1)
>>
>> 10 GbE networking seems to be helping a lot, it could be just the superior
>> switch response on a higher end switch.
>>
>> Using blk_mq scheduler, it's been reported to improve performance on random
>> IO.
>>
>> Good luck!
>>
>> --
>> Alex Gorbachev
>> Storcium
>>
>> On Sun, Nov 8, 2015 at 5:07 PM, Timofey Titovets <nefelim4ag@xxxxxxxxx>
>> wrote:
>>>
>>> Big thanks Nick, any way
>>> Now i catch hangs of ESXi and  Proxy =_=''
>>> /* Proxy VM: Ubuntu 15.10/Kernel 4.3/LIO/Ceph 0.94/ESXi 6.0 Software
>>> iSCSI*/
>>> I've moved to NFS-RBD proxy and now try to make it HA
>>>
>>> 2015-11-07 18:59 GMT+03:00 Nick Fisk <nick@xxxxxxxxxx>:
>>> > Hi Timofey,
>>> >
>>> > You are most likely experiencing the effects of Ceph's write latency in
>>> > combination with the sync write behaviour of ESXi. You will probably
>>> > struggle to get much under 2ms write latency with Ceph, assuming a minimum
>>> > of 2 copies in Ceph. This will limit you to around 500iops for a QD of 1.
>>> > Because of this you will also experience slow file/VM copies, as ESXi moves
>>> > the blocks of data around in 64kb sync IO's. 500x64kb = ~30MB/s.
>>> >
>>> > Moving to 10GB end to end may get you a reasonable boost in performance
>>> > as you will be removing a 1ms or so of latency from the network for each
>>> > write. Also search the mailing list for small performance tweaks you can do,
>>> > like disabling logging.
>>> >
>>> > Other than that the only thing I have found that has chance of giving
>>> > you performance similar to other products and/or legacy SAN's is to use some
>>> > sort of RBD caching with something like flashcache/enhanceio/bcache o nyour
>>> > proxy nodes. However this brings its on challenges and I still haven't got
>>> > to a point that I'm happy to deploy it.
>>> >
>>> > I'm surprised you are also not seeing LIO hangs, which several people
>>> > including me experience when using RBD+LIO+ESXi, although I haven't checked
>>> > recently to see if this is now working better. I would be interesting in
>>> > hearing your feedback on this. They normally manifest themselves when an OSD
>>> > drops out and IO is suspended for more than 5-10s.
>>> >
>>> > Sorry I couldn't be of more help.
>>> >
>>> > Nick
>>> >
>>> >> -----Original Message-----
>>> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
>>> >> Of
>>> >> Timofey Titovets
>>> >> Sent: 07 November 2015 11:44
>>> >> To: ceph-users@xxxxxxxxxxxxxx
>>> >> Subject:  Ceph RBD LIO ESXi Advice?
>>> >>
>>> >> Hi List,
>>> >> I Searching for advice from somebody, who use Legacy client like ESXi
>>> >> with
>>> >> Ceph
>>> >>
>>> >> I try to build High-performance fault-tolerant storage with Ceph 0.94
>>> >>
>>> >> In production i have 50+ TB of VMs (~800 VMs)
>>> >> 8 NFS servers each:
>>> >> 2xIntel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz 12xSeagate ST2000NM0023
>>> >> 1xLSI Nytro™ MegaRAID® NMR 8110-4i
>>> >> 96 GB of RAM
>>> >> 4x 1 GBE links in Balance-ALB mode (I don't have problem with network
>>> >> throughput)
>>> >>
>>> >> Now in lab. i have build 3 node cluster like:
>>> >> Kernel 4.2
>>> >> Intel(R) Xeon(R) CPU 5130  @ 2.00GHz
>>> >> 16 Gb of RAM
>>> >> 6xSeagate ST2000NM0033
>>> >> 2x 1GBE in Balance-ALB
>>> >> i.e. each node is a MON and 6 OSDs
>>> >>
>>> >>
>>> >> Config like:
>>> >> osd journal size = 16384
>>> >> osd pool default size = 2
>>> >> osd pool default min size = 2
>>> >> osd pool default pg num = 256
>>> >> osd pool default pgp num = 256
>>> >> osd crush chooseleaf type = 1
>>> >> filestore max sync interval = 180
>>> >>
>>> >> For attach RBD Storage to ESXi i create a 2 VMs:
>>> >> 2 cores
>>> >> 2 GB RAM
>>> >> Kernel 4.3
>>> >> Each vm map big RBD volume and proxy it by LIO to ESXi ESXi see VMs
>>> >> like
>>> >> iSCSI Target server in Active/Passive mode
>>> >>
>>> >> RBD created with --image-shared and --image-format 2 keys
>>> >>
>>> >> My Questions:
>>> >> 1. I have architecture problem?
>>> >> 2. May be you have ideas?
>>> >> 3. ESXi working with iSCSI storage very slow(30-60 Mb/s read/write),
>>> >> but this
>>> >> is can be a ESXi problem, later i will test this with more modern
>>> >> Hypervisor
>>> >> server 4. Proxy VMs working not too bad with storage, but fio shows too
>>> >> low
>>> >> numbers:
>>> >> [global]
>>> >> size=128g   # File size
>>> >> filename=/storage/testfile.fio
>>> >> numjobs=1   # One tread
>>> >> runtime=600 # 10m for each test
>>> >> ioengine=libaio # Use async io
>>> >>         # Pseude random data, can be compressed by 15%
>>> >> buffer_compress_percentage=15
>>> >> overwrite=1 # Overwrite data in file
>>> >> end_fsync=1 # Doing fsync, at the and of test, for sync OS buffers
>>> >> direct=1    # Bypass OS cache
>>> >> startdelay=30   # Pause between tests
>>> >> bs=4k       # Block size for io requests
>>> >> iodepth=64  # Count of IO request, what can be requested asynchronously
>>> >> rw=randrw   # Random Read/Write
>>> >> ####################################################
>>> >> # IOMeter defines the server loads as the following:
>>> >> # iodepth=1   # Linear
>>> >> # iodepth=4   # Very Light
>>> >> # iodepth=8   # Light
>>> >> # iodepth=64  # Moderate
>>> >> # iodepth=256 # Heavy
>>> >> ####################################################
>>> >> [Disk-4k-randomrw-depth-1]
>>> >> rwmixread=50
>>> >> iodepth=1
>>> >> stonewall # Do each test separated
>>> >> ####################################################
>>> >> [Disk-4k-randomrw-depth-8]
>>> >> rwmixread=50
>>> >> iodepth=8
>>> >> stonewall
>>> >> ####################################################
>>> >> [Disk-4k-randomrw-depth-64]
>>> >> rwmixread=50
>>> >> stonewall
>>> >> ####################################################
>>> >> [Disk-4k-randomrw-depth-256]
>>> >> rwmixread=50
>>> >> iodepth=256
>>> >> stonewall
>>> >> ####################################################
>>> >> [Disk-4k-randomrw-depth-512]
>>> >> rwmixread=50
>>> >> iodepth=512
>>> >> stonewall
>>> >> ####################################################
>>> >> [Disk-4k-randomrw-depth-1024]
>>> >> rwmixread=50
>>> >> iodepth=1024
>>> >> stonewall
>>> >> -- cut --
>>> >>
>>> >> RBD-LIO-PROXY:
>>> >> -- cut --
>>> >> Disk-4k-randomrw-depth-512: (groupid=4, jobs=1): err= 0: pid=10601:
>>> >> Sat Nov  7 13:59:49 2015
>>> >>   read : io=770772KB, bw=1282.1KB/s, iops=320, runt=600813msec
>>> >>     clat (msec): min=141, max=8456, avg=715.87, stdev=748.55
>>> >>   write: io=769400KB, bw=1280.7KB/s, iops=320, runt=600813msec
>>> >>     clat (msec): min=158, max=9862, avg=878.73, stdev=905.47
>>> >> -- cut --
>>> >> One of node in Raid0:
>>> >> Disk-4k-randomrw-depth-512: (groupid=4, jobs=1): err= 0: pid=4652: Fri
>>> >> Oct
>>> >> 30 16:29:00 2015
>>> >>   read : io=258500KB, bw=2128.4KB/s, iops=532, runt=121455msec
>>> >>     clat (msec): min=1, max=3983, avg=484.80, stdev=478.39
>>> >>   write: io=257568KB, bw=2120.8KB/s, iops=530, runt=121455msec
>>> >>     clat (usec): min=217, max=3976.1K, avg=478327.33, stdev=480695.05
>>> >> -- cut --
>>> >>
>>> >> By me expirience with ScaleIO, must get on proxy node numbers like
>>> >> ~1000
>>> >> IOPs
>>> >>
>>> >> I can provide full  FIO config and logs if it needed, i just try to fix
>>> >> perfomance
>>> >> problem and search for advice
>>> >>
>>> >> 5. May be i must change my FIO config?
>>> >> 6. May be i missing something?
>>> >>
>>> >> If someone have a expirience with similar solutions, story and links a
>>> >> welcomed -.-
>>> >>
>>> >> --
>>> >> Have a nice day,
>>> >> Timofey.
>>> >> _______________________________________________
>>> >> ceph-users mailing list
>>> >> ceph-users@xxxxxxxxxxxxxx
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Have a nice day,
>>> Timofey.
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
>
> --
> Have a nice day,
> Timofey.

-- 
Have a nice day,
Timofey.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com