Re: Ceph all NVME Cluster sequential read speed

Nick Fisk <nick@xxxxxxxxxx> · Thu, 18 Aug 2016 14:44:00 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of nick
> Sent: 18 August 2016 14:02
> To: nick@xxxxxxxxxx
> Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Ceph all NVME Cluster sequential read speed
> 
> Thanks for the explanation. I thought that when using a striped image 4MB of written data will be placed in 4 objects (with 4MB
object
> size and when using 1MB of stripe unit and a count of 4). With that a single read of 4MB will hit
> 4 objects which might be in different PGs. So the read speed should be increased. Maybe I got that part wrong :-) I might have the
> same speed improvement when using an object size of 1MB directly on the image.

Yes, that is correct. But you were sending 4k io's, so it wouldn't have changed much apart from the data might not be in the OSD
pagecache because you are jumping around PG's. Another factor is latency again, with a 4MB object you are doing 1 read if you read
4MB, with 4x1MB objects you are having to issue more IO through Ceph which will incur a slight latency penalty, which might be why
you see slightly less performance.

> 
> Cheers
> Nick
> 
> On Thursday, August 18, 2016 01:37:46 PM Nick Fisk wrote:
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > Behalf Of nick Sent: 18 August 2016 12:39
> > > To: nick@xxxxxxxxxx
> > > Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>
> > > Subject: Re:  Ceph all NVME Cluster sequential read
> > > speed
> > >
> > > So after disabling logging and setting intel_idle.max_cstate=1 we
> > > reach
> > > 1953 IOPS for 4k blocksizes (with an iodepth of 1) instead
> > of
> >
> > > 1382. This is an increase of 41%. Very cool.
> > >
> > > Furthermore I played a bit with striping in RBD images. When
> > > choosing a 1MB stripe unit and a stripe count of 4 there is a huge
> > > difference when benchmarking with bigger block sizes (with 4MB
> > > blocksize I get twice the speed). Benchmarking this with 4k
> > > blocksizes I can see almost no difference to the default images (stripe-unit=4M and stripe-count=1).
> > >
> > > Did anyone else play with different stripe units in the images? I
> > > guess the stripe unit depends on the expected work pattern in
> > the
> >
> > > virtual machine.
> >
> > The RBD is already striped in object sized chunks, the difference to
> > RAID stripes is the size of the chunks/objects involved. A RAID array
> > might chunk into 64kb chunks, this will mean that even a small
> > readahead will likely cause a read across all chunks of the stripe,
> > giving very good performance. In Ceph, the chunks are 4MB which means
> > if you want to read across multiple objects, you will need a readahead
> > at least bigger than 4MB.
> >
> > The image level striping is more to do with lowering contention on a
> > single PG, rather than to improve sequential performance. Ie you might
> > have a couple of MB worth of data that is being hit by thousands of IO requests.
> > By using striping you can try and spread these requests over more PG's.
> > There is a point in the data path of a PG that is effectively single
> > threaded.
> >
> > If you want to improve sequential reads you want to use buffered IO
> > and use a large read ahead (>16M).
> > > Cheers
> > > Nick
> > >
> > > On Thursday, August 18, 2016 10:23:34 AM Nick Fisk wrote:
> > > > > -----Original Message-----
> > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > > > Behalf Of wido@xxxxxxxx Sent: 18 August 2016 09:35
> > > > > To: nick <nick@xxxxxxx>
> > > > > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > > > > Subject: Re:  Ceph all NVME Cluster sequential read
> > > > > speed
> > > > >
> > > > > > Op 18 aug. 2016 om 10:15 heeft nick <nick@xxxxxxx> het
> > > > > > volgende
> > > > > > geschreven:
> > > > > >
> > > > > > Hi,
> > > > > > we are currently building a new ceph cluster with only NVME devices.
> > > > > > One Node consists of 4x Intel P3600 2TB devices. Journal and
> > > > > > filestore are on the same device. Each server has a 10 core
> > > > > > CPU and uses 10 GBit ethernet NICs for public and ceph storage
> > > > > > traffic. We are currently testing with 4 nodes overall.
> > > > > >
> > > > > > The cluster will be used only for virtual machine images via RBD.
> > > > > > The pools are replicated (no EC).
> > > > > >
> > > > > > Altough we are pretty happy with the single threaded write
> > > > > > performance, the single threaded (iodepth=1) sequential read
> > > > > > performance is a bit disappointing.
> > > > > >
> > > > > > We are testing with fio and the rbd engine. After creating a
> > > > > > 10GB RBD image, we use the following fio params to test:
> > > > > > """
> > > > > > [global]
> > > > > > invalidate=1
> > > > > > ioengine=rbd
> > > > > > iodepth=1
> > > > > > ramp_time=2
> > > > > > size=2G
> > > > > > bs=4k
> > > > > > direct=1
> > > > > > buffered=0
> > > > > > """
> > > > > >
> > > > > > For a 4k workload we are reaching 1382 IOPS. Testing one NVME
> > > > > > device directly (with psync engine and iodepth of 1) we can
> > > > > > reach up to 84176 IOPS. This is a big difference.
> > > > >
> > > > > Network is a big difference as well. Keep in mind the Ceph OSDs
> > > > > have to process the I/O as well.
> > > > >
> > > > > For example, if you have a network latency of 0.200ms, in
> > > > > 1.000ms (1
> > > > > sec) you will be able to potentially do 5.000 IOps, but that
> > > >
> > > > is
> > > >
> > > > > without the OSD or any other layers doing any work.
> > > > >
> > > > > > I already read that the read_ahead setting might improve the
> > > > > > situation, although this would only be true when using
> > > > > > buffered reads, right?
> > > > > >
> > > > > > Does anyone have other suggestions to get better serial read
> > > > > > performance?
> > > > >
> > > > > You might want to disable all logging and look at AsyncMessenger.
> > > > > Disabling cephx might help, but that is not very safe to do.
> > > >
> > > > Just to add what Wido has mentioned. The problem is latency
> > > > serialisation, the effect of the network, ceph code means that
> > > > each IO request has to travel further than if you are comparing to
> > > > a local SATA cable.
> > > >
> > > > The trick is to try and remove as much of this as possible where
> > > > you can.
> > > > Wido has mentioned 1 good option of turning off logging. One thing
> > > > I have found which helps massively is to force the CPU c-state to
> > > > 1 and pin the CPU's at their max frequency. Otherwise the CPU's
> > > > can spend up to 200us waking up from deep sleep several times
> > > > every IO. Doing this I managed to get my 4kb write latency for a
> > > > 3x replica pool down to 600us!!
> > > >
> > > > So stick this on your kernel boot line
> > > >
> > > > intel_idle.max_cstate=1
> > > >
> > > > and stick this somewhere like your rc.local
> > > >
> > > > echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
> > > >
> > > > Although there maybe some gains to setting it to 90-95%, so that
> > > > when only 1 core is active it can turbo slightly higher.
> > > >
> > > > Also since you are using the RBD engine in fio you should be able
> > > > to use readahead caching with directio. You just need to enable it
> > > > in your ceph.conf on the client machine where you are running fio.
> > > >
> > > > Nick
> > > >
> > > > > Wido
> > > > >
> > > > > > Cheers
> > > > > > Nick
> > > > > >
> > > > > > --
> > > > > > Sebastian Nickel
> > > > > > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047
> > > > > > Zuerich Tel
> > > > > > +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > --
> > > Sebastian Nickel
> > > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
> > > Tel +41
> > > 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> 
> --
> Sebastian Nickel
> Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com