Re: Ceph all NVME Cluster sequential read speed

Nick Fisk <nick@xxxxxxxxxx> · Thu, 18 Aug 2016 13:37:46 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of nick
> Sent: 18 August 2016 12:39
> To: nick@xxxxxxxxxx
> Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Ceph all NVME Cluster sequential read speed
> 
> So after disabling logging and setting intel_idle.max_cstate=1 we reach 1953 IOPS for 4k blocksizes (with an iodepth of 1) instead
of
> 1382. This is an increase of 41%. Very cool.
> 
> Furthermore I played a bit with striping in RBD images. When choosing a 1MB stripe unit and a stripe count of 4 there is a huge
> difference when benchmarking with bigger block sizes (with 4MB blocksize I get twice the speed). Benchmarking this with 4k
> blocksizes I can see almost no difference to the default images (stripe-unit=4M and stripe-count=1).
> 
> Did anyone else play with different stripe units in the images? I guess the stripe unit depends on the expected work pattern in
the
> virtual machine.

The RBD is already striped in object sized chunks, the difference to RAID stripes is the size of the chunks/objects involved. A RAID
array might chunk into 64kb chunks, this will mean that even a small readahead will likely cause a read across all chunks of the
stripe, giving very good performance. In Ceph, the chunks are 4MB which means if you want to read across multiple objects, you will
need a readahead at least bigger than 4MB.

The image level striping is more to do with lowering contention on a single PG, rather than to improve sequential performance. Ie
you might have a couple of MB worth of data that is being hit by thousands of IO requests. By using striping you can try and spread
these requests over more PG's. There is a point in the data path of a PG that is effectively single threaded.

If you want to improve sequential reads you want to use buffered IO and use a large read ahead (>16M).

> 
> Cheers
> Nick
> 
> On Thursday, August 18, 2016 10:23:34 AM Nick Fisk wrote:
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > Behalf Of wido@xxxxxxxx Sent: 18 August 2016 09:35
> > > To: nick <nick@xxxxxxx>
> > > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > > Subject: Re:  Ceph all NVME Cluster sequential read
> > > speed
> > >
> > > > Op 18 aug. 2016 om 10:15 heeft nick <nick@xxxxxxx> het volgende
> > > > geschreven:
> > > >
> > > > Hi,
> > > > we are currently building a new ceph cluster with only NVME devices.
> > > > One Node consists of 4x Intel P3600 2TB devices. Journal and
> > > > filestore are on the same device. Each server has a 10 core CPU
> > > > and uses 10 GBit ethernet NICs for public and ceph storage
> > > > traffic. We are currently testing with 4 nodes overall.
> > > >
> > > > The cluster will be used only for virtual machine images via RBD.
> > > > The pools are replicated (no EC).
> > > >
> > > > Altough we are pretty happy with the single threaded write
> > > > performance, the single threaded (iodepth=1) sequential read
> > > > performance is a bit disappointing.
> > > >
> > > > We are testing with fio and the rbd engine. After creating a 10GB
> > > > RBD image, we use the following fio params to test:
> > > > """
> > > > [global]
> > > > invalidate=1
> > > > ioengine=rbd
> > > > iodepth=1
> > > > ramp_time=2
> > > > size=2G
> > > > bs=4k
> > > > direct=1
> > > > buffered=0
> > > > """
> > > >
> > > > For a 4k workload we are reaching 1382 IOPS. Testing one NVME
> > > > device directly (with psync engine and iodepth of 1) we can reach
> > > > up to 84176 IOPS. This is a big difference.
> > >
> > > Network is a big difference as well. Keep in mind the Ceph OSDs have
> > > to process the I/O as well.
> > >
> > > For example, if you have a network latency of 0.200ms, in 1.000ms (1
> > > sec) you will be able to potentially do 5.000 IOps, but that
> > is
> >
> > > without the OSD or any other layers doing any work.
> > >
> > > > I already read that the read_ahead setting might improve the
> > > > situation, although this would only be true when using buffered
> > > > reads, right?
> > > >
> > > > Does anyone have other suggestions to get better serial read
> > > > performance?
> > >
> > > You might want to disable all logging and look at AsyncMessenger.
> > > Disabling cephx might help, but that is not very safe to do.
> > Just to add what Wido has mentioned. The problem is latency
> > serialisation, the effect of the network, ceph code means that each IO
> > request has to travel further than if you are comparing to a local SATA cable.
> >
> > The trick is to try and remove as much of this as possible where you can.
> > Wido has mentioned 1 good option of turning off logging. One thing I
> > have found which helps massively is to force the CPU c-state to 1 and
> > pin the CPU's at their max frequency. Otherwise the CPU's can spend up
> > to 200us waking up from deep sleep several times every IO. Doing this
> > I managed to get my 4kb write latency for a 3x replica pool down to 600us!!
> >
> > So stick this on your kernel boot line
> >
> > intel_idle.max_cstate=1
> >
> > and stick this somewhere like your rc.local
> >
> > echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
> >
> > Although there maybe some gains to setting it to 90-95%, so that when
> > only 1 core is active it can turbo slightly higher.
> >
> > Also since you are using the RBD engine in fio you should be able to
> > use readahead caching with directio. You just need to enable it in
> > your ceph.conf on the client machine where you are running fio.
> >
> > Nick
> >
> > > Wido
> > >
> > > > Cheers
> > > > Nick
> > > >
> > > > --
> > > > Sebastian Nickel
> > > > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
> > > > Tel
> > > > +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Sebastian Nickel
> Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com