Re: Ceph all NVME Cluster sequential read speed

nick <nick@xxxxxxx> · Thu, 18 Aug 2016 13:39:06 +0200

So after disabling logging and setting intel_idle.max_cstate=1 we reach 1953 
IOPS for 4k blocksizes (with an iodepth of 1) instead of 1382. This is an 
increase of 41%. Very cool.

Furthermore I played a bit with striping in RBD images. When choosing a 1MB 
stripe unit and a stripe count of 4 there is a huge difference when 
benchmarking with bigger block sizes (with 4MB blocksize I get twice the 
speed). Benchmarking this with 4k blocksizes I can see almost no difference to 
the default images (stripe-unit=4M and stripe-count=1).

Did anyone else play with different stripe units in the images? I guess the 
stripe unit depends on the expected work pattern in the virtual machine.

Cheers
Nick

On Thursday, August 18, 2016 10:23:34 AM Nick Fisk wrote:
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> > wido@xxxxxxxx Sent: 18 August 2016 09:35
> > To: nick <nick@xxxxxxx>
> > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > Subject: Re:  Ceph all NVME Cluster sequential read speed
> > 
> > > Op 18 aug. 2016 om 10:15 heeft nick <nick@xxxxxxx> het volgende
> > > geschreven:
> > > 
> > > Hi,
> > > we are currently building a new ceph cluster with only NVME devices.
> > > One Node consists of 4x Intel P3600 2TB devices. Journal and filestore
> > > are on the same device. Each server has a 10 core CPU and uses 10 GBit
> > > ethernet NICs for public and ceph storage traffic. We are currently
> > > testing with 4 nodes overall.
> > > 
> > > The cluster will be used only for virtual machine images via RBD. The
> > > pools are replicated (no EC).
> > > 
> > > Altough we are pretty happy with the single threaded write
> > > performance, the single threaded (iodepth=1) sequential read
> > > performance is a bit disappointing.
> > > 
> > > We are testing with fio and the rbd engine. After creating a 10GB RBD
> > > image, we use the following fio params to test:
> > > """
> > > [global]
> > > invalidate=1
> > > ioengine=rbd
> > > iodepth=1
> > > ramp_time=2
> > > size=2G
> > > bs=4k
> > > direct=1
> > > buffered=0
> > > """
> > > 
> > > For a 4k workload we are reaching 1382 IOPS. Testing one NVME device
> > > directly (with psync engine and iodepth of 1) we can reach up to 84176
> > > IOPS. This is a big difference.
> > 
> > Network is a big difference as well. Keep in mind the Ceph OSDs have to
> > process the I/O as well.
> > 
> > For example, if you have a network latency of 0.200ms, in 1.000ms (1 sec)
> > you will be able to potentially do 5.000 IOps, but that
> is
> 
> > without the OSD or any other layers doing any work.
> > 
> > > I already read that the read_ahead setting might improve the
> > > situation, although this would only be true when using buffered reads,
> > > right?
> > > 
> > > Does anyone have other suggestions to get better serial read
> > > performance?
> > 
> > You might want to disable all logging and look at AsyncMessenger.
> > Disabling cephx might help, but that is not very safe to do.
> Just to add what Wido has mentioned. The problem is latency serialisation,
> the effect of the network, ceph code means that each IO request has to
> travel further than if you are comparing to a local SATA cable.
> 
> The trick is to try and remove as much of this as possible where you can.
> Wido has mentioned 1 good option of turning off logging. One thing I have
> found which helps massively is to force the CPU c-state to 1 and pin the
> CPU's at their max frequency. Otherwise the CPU's can spend up to 200us
> waking up from deep sleep several times every IO. Doing this I managed to
> get my 4kb write latency for a 3x replica pool down to 600us!!
> 
> So stick this on your kernel boot line
> 
> intel_idle.max_cstate=1
> 
> and stick this somewhere like your rc.local
> 
> echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
> 
> Although there maybe some gains to setting it to 90-95%, so that when only 1
> core is active it can turbo slightly higher.
> 
> Also since you are using the RBD engine in fio you should be able to use
> readahead caching with directio. You just need to enable it in your
> ceph.conf on the client machine where you are running fio.
> 
> Nick
> 
> > Wido
> > 
> > > Cheers
> > > Nick
> > > 
> > > --
> > > Sebastian Nickel
> > > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel
> > > +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Sebastian Nickel
Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
Attachment:
signature.asc

Description: This is a digitally signed message part.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com