Re: Ceph RBD LIO ESXi Advice?

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Mon, 9 Nov 2015 08:25:32 -0500

Hi Timofey,
With Nick's, Jan's, RedHat's and others' help we have a stable and, in my best judgement, well performing system using SCST as the iSCSI delivery framework.  SCST allows the use of Linux page cache when utilizing the vdisk_fileio backend.  LIO should be able to do this to using FILEIO backstore and the block device name as file name, but I have not tried that due to having switched to SCST for stability.

The page cache will improve latency due to the reads and writes first occurring in RAM.  Naturally, all the usual considerations apply as to the loss of dirty pages on machine crash.  So tuning the vm.dirty* parameters is quite important.

This setting was critically important to avoid hangs and major issues due to some problem with XFS and page cache on OSD nodes:

sysctl vm.min_free_kbytes=1048576

(reserved memory when using vm.swappiness = 1)

10 GbE networking seems to be helping a lot, it could be just the superior switch response on a higher end switch.

Using blk_mq scheduler, it's been reported to improve performance on random IO.

Good luck!

--Alex Gorbachev
Storcium

On Sun, Nov 8, 2015 at 5:07 PM, Timofey Titovets <nefelim4ag@xxxxxxxxx> wrote:
Big thanks Nick, any way

Now i catch hangs of ESXi and  Proxy =_=''

/* Proxy VM: Ubuntu 15.10/Kernel 4.3/LIO/Ceph 0.94/ESXi 6.0 Software iSCSI*/

I've moved to NFS-RBD proxy and now try to make it HA

2015-11-07 18:59 GMT+03:00 Nick Fisk <nick@xxxxxxxxxx>:

> Hi Timofey,

>

> You are most likely experiencing the effects of Ceph's write latency in combination with the sync write behaviour of ESXi. You will probably struggle to get much under 2ms write latency with Ceph, assuming a minimum of 2 copies in Ceph. This will limit you to around 500iops for a QD of 1. Because of this you will also experience slow file/VM copies, as ESXi moves the blocks of data around in 64kb sync IO's. 500x64kb = ~30MB/s.

>

> Moving to 10GB end to end may get you a reasonable boost in performance as you will be removing a 1ms or so of latency from the network for each write. Also search the mailing list for small performance tweaks you can do, like disabling logging.

>

> Other than that the only thing I have found that has chance of giving you performance similar to other products and/or legacy SAN's is to use some sort of RBD caching with something like flashcache/enhanceio/bcache o nyour proxy nodes. However this brings its on challenges and I still haven't got to a point that I'm happy to deploy it.

>

> I'm surprised you are also not seeing LIO hangs, which several people including me experience when using RBD+LIO+ESXi, although I haven't checked recently to see if this is now working better. I would be interesting in hearing your feedback on this. They normally manifest themselves when an OSD drops out and IO is suspended for more than 5-10s.

>

> Sorry I couldn't be of more help.

>

> Nick

>

>> -----Original Message-----

>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of

>> Timofey Titovets

>> Sent: 07 November 2015 11:44

>> To: ceph-users@xxxxxxxxxxxxxx

>> Subject:  Ceph RBD LIO ESXi Advice?

>>

>> Hi List,

>> I Searching for advice from somebody, who use Legacy client like ESXi with

>> Ceph

>>

>> I try to build High-performance fault-tolerant storage with Ceph 0.94

>>

>> In production i have 50+ TB of VMs (~800 VMs)

>> 8 NFS servers each:

>> 2xIntel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz 12xSeagate ST2000NM0023

>> 1xLSI Nytro™ MegaRAID® NMR 8110-4i

>> 96 GB of RAM

>> 4x 1 GBE links in Balance-ALB mode (I don't have problem with network

>> throughput)

>>

>> Now in lab. i have build 3 node cluster like:

>> Kernel 4.2

>> Intel(R) Xeon(R) CPU 5130  @ 2.00GHz

>> 16 Gb of RAM

>> 6xSeagate ST2000NM0033

>> 2x 1GBE in Balance-ALB

>> i.e. each node is a MON and 6 OSDs

>>

>>

>> Config like:

>> osd journal size = 16384

>> osd pool default size = 2

>> osd pool default min size = 2

>> osd pool default pg num = 256

>> osd pool default pgp num = 256

>> osd crush chooseleaf type = 1

>> filestore max sync interval = 180

>>

>> For attach RBD Storage to ESXi i create a 2 VMs:

>> 2 cores

>> 2 GB RAM

>> Kernel 4.3

>> Each vm map big RBD volume and proxy it by LIO to ESXi ESXi see VMs like

>> iSCSI Target server in Active/Passive mode

>>

>> RBD created with --image-shared and --image-format 2 keys

>>

>> My Questions:

>> 1. I have architecture problem?

>> 2. May be you have ideas?

>> 3. ESXi working with iSCSI storage very slow(30-60 Mb/s read/write), but this

>> is can be a ESXi problem, later i will test this with more modern Hypervisor

>> server 4. Proxy VMs working not too bad with storage, but fio shows too low

>> numbers:

>> [global]

>> size=128g   # File size

>> filename=/storage/testfile.fio

>> numjobs=1   # One tread

>> runtime=600 # 10m for each test

>> ioengine=libaio # Use async io

>>         # Pseude random data, can be compressed by 15%

>> buffer_compress_percentage=15

>> overwrite=1 # Overwrite data in file

>> end_fsync=1 # Doing fsync, at the and of test, for sync OS buffers

>> direct=1    # Bypass OS cache

>> startdelay=30   # Pause between tests

>> bs=4k       # Block size for io requests

>> iodepth=64  # Count of IO request, what can be requested asynchronously

>> rw=randrw   # Random Read/Write

>> ####################################################

>> # IOMeter defines the server loads as the following:

>> # iodepth=1   # Linear

>> # iodepth=4   # Very Light

>> # iodepth=8   # Light

>> # iodepth=64  # Moderate

>> # iodepth=256 # Heavy

>> ####################################################

>> [Disk-4k-randomrw-depth-1]

>> rwmixread=50

>> iodepth=1

>> stonewall # Do each test separated

>> ####################################################

>> [Disk-4k-randomrw-depth-8]

>> rwmixread=50

>> iodepth=8

>> stonewall

>> ####################################################

>> [Disk-4k-randomrw-depth-64]

>> rwmixread=50

>> stonewall

>> ####################################################

>> [Disk-4k-randomrw-depth-256]

>> rwmixread=50

>> iodepth=256

>> stonewall

>> ####################################################

>> [Disk-4k-randomrw-depth-512]

>> rwmixread=50

>> iodepth=512

>> stonewall

>> ####################################################

>> [Disk-4k-randomrw-depth-1024]

>> rwmixread=50

>> iodepth=1024

>> stonewall

>> -- cut --

>>

>> RBD-LIO-PROXY:

>> -- cut --

>> Disk-4k-randomrw-depth-512: (groupid=4, jobs=1): err= 0: pid=10601:

>> Sat Nov  7 13:59:49 2015

>>   read : io=770772KB, bw=1282.1KB/s, iops=320, runt=600813msec

>>     clat (msec): min=141, max=8456, avg=715.87, stdev=748.55

>>   write: io=769400KB, bw=1280.7KB/s, iops=320, runt=600813msec

>>     clat (msec): min=158, max=9862, avg=878.73, stdev=905.47

>> -- cut --

>> One of node in Raid0:

>> Disk-4k-randomrw-depth-512: (groupid=4, jobs=1): err= 0: pid=4652: Fri Oct

>> 30 16:29:00 2015

>>   read : io=258500KB, bw=2128.4KB/s, iops=532, runt=121455msec

>>     clat (msec): min=1, max=3983, avg=484.80, stdev=478.39

>>   write: io=257568KB, bw=2120.8KB/s, iops=530, runt=121455msec

>>     clat (usec): min=217, max=3976.1K, avg=478327.33, stdev=480695.05

>> -- cut --

>>

>> By me expirience with ScaleIO, must get on proxy node numbers like ~1000

>> IOPs

>>

>> I can provide full  FIO config and logs if it needed, i just try to fix perfomance

>> problem and search for advice

>>

>> 5. May be i must change my FIO config?

>> 6. May be i missing something?

>>

>> If someone have a expirience with similar solutions, story and links a

>> welcomed -.-

>>

>> --

>> Have a nice day,

>> Timofey.

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

>

>

>

--

Have a nice day,

Timofey.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com