> -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of nick > Sent: 18 August 2016 12:39 > To: nick@xxxxxxxxxx > Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx> > Subject: Re: Ceph all NVME Cluster sequential read speed > > So after disabling logging and setting intel_idle.max_cstate=1 we reach 1953 IOPS for 4k blocksizes (with an iodepth of 1) instead of > 1382. This is an increase of 41%. Very cool. > > Furthermore I played a bit with striping in RBD images. When choosing a 1MB stripe unit and a stripe count of 4 there is a huge > difference when benchmarking with bigger block sizes (with 4MB blocksize I get twice the speed). Benchmarking this with 4k > blocksizes I can see almost no difference to the default images (stripe-unit=4M and stripe-count=1). > > Did anyone else play with different stripe units in the images? I guess the stripe unit depends on the expected work pattern in the > virtual machine. The RBD is already striped in object sized chunks, the difference to RAID stripes is the size of the chunks/objects involved. A RAID array might chunk into 64kb chunks, this will mean that even a small readahead will likely cause a read across all chunks of the stripe, giving very good performance. In Ceph, the chunks are 4MB which means if you want to read across multiple objects, you will need a readahead at least bigger than 4MB. The image level striping is more to do with lowering contention on a single PG, rather than to improve sequential performance. Ie you might have a couple of MB worth of data that is being hit by thousands of IO requests. By using striping you can try and spread these requests over more PG's. There is a point in the data path of a PG that is effectively single threaded. If you want to improve sequential reads you want to use buffered IO and use a large read ahead (>16M). > > Cheers > Nick > > On Thursday, August 18, 2016 10:23:34 AM Nick Fisk wrote: > > > -----Original Message----- > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > > Behalf Of wido@xxxxxxxx Sent: 18 August 2016 09:35 > > > To: nick <nick@xxxxxxx> > > > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx> > > > Subject: Re: Ceph all NVME Cluster sequential read > > > speed > > > > > > > Op 18 aug. 2016 om 10:15 heeft nick <nick@xxxxxxx> het volgende > > > > geschreven: > > > > > > > > Hi, > > > > we are currently building a new ceph cluster with only NVME devices. > > > > One Node consists of 4x Intel P3600 2TB devices. Journal and > > > > filestore are on the same device. Each server has a 10 core CPU > > > > and uses 10 GBit ethernet NICs for public and ceph storage > > > > traffic. We are currently testing with 4 nodes overall. > > > > > > > > The cluster will be used only for virtual machine images via RBD. > > > > The pools are replicated (no EC). > > > > > > > > Altough we are pretty happy with the single threaded write > > > > performance, the single threaded (iodepth=1) sequential read > > > > performance is a bit disappointing. > > > > > > > > We are testing with fio and the rbd engine. After creating a 10GB > > > > RBD image, we use the following fio params to test: > > > > """ > > > > [global] > > > > invalidate=1 > > > > ioengine=rbd > > > > iodepth=1 > > > > ramp_time=2 > > > > size=2G > > > > bs=4k > > > > direct=1 > > > > buffered=0 > > > > """ > > > > > > > > For a 4k workload we are reaching 1382 IOPS. Testing one NVME > > > > device directly (with psync engine and iodepth of 1) we can reach > > > > up to 84176 IOPS. This is a big difference. > > > > > > Network is a big difference as well. Keep in mind the Ceph OSDs have > > > to process the I/O as well. > > > > > > For example, if you have a network latency of 0.200ms, in 1.000ms (1 > > > sec) you will be able to potentially do 5.000 IOps, but that > > is > > > > > without the OSD or any other layers doing any work. > > > > > > > I already read that the read_ahead setting might improve the > > > > situation, although this would only be true when using buffered > > > > reads, right? > > > > > > > > Does anyone have other suggestions to get better serial read > > > > performance? > > > > > > You might want to disable all logging and look at AsyncMessenger. > > > Disabling cephx might help, but that is not very safe to do. > > Just to add what Wido has mentioned. The problem is latency > > serialisation, the effect of the network, ceph code means that each IO > > request has to travel further than if you are comparing to a local SATA cable. > > > > The trick is to try and remove as much of this as possible where you can. > > Wido has mentioned 1 good option of turning off logging. One thing I > > have found which helps massively is to force the CPU c-state to 1 and > > pin the CPU's at their max frequency. Otherwise the CPU's can spend up > > to 200us waking up from deep sleep several times every IO. Doing this > > I managed to get my 4kb write latency for a 3x replica pool down to 600us!! > > > > So stick this on your kernel boot line > > > > intel_idle.max_cstate=1 > > > > and stick this somewhere like your rc.local > > > > echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct > > > > Although there maybe some gains to setting it to 90-95%, so that when > > only 1 core is active it can turbo slightly higher. > > > > Also since you are using the RBD engine in fio you should be able to > > use readahead caching with directio. You just need to enable it in > > your ceph.conf on the client machine where you are running fio. > > > > Nick > > > > > Wido > > > > > > > Cheers > > > > Nick > > > > > > > > -- > > > > Sebastian Nickel > > > > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich > > > > Tel > > > > +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@xxxxxxxxxxxxxx > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- > Sebastian Nickel > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com