Re: Ceph all NVME Cluster sequential read speed

nick <nick@xxxxxxx> · Thu, 18 Aug 2016 15:01:55 +0200

Thanks for the explanation. I thought that when using a striped image 4MB of 
written data will be placed in 4 objects (with 4MB object size and when using 
1MB of stripe unit and a count of 4). With that a single read of 4MB will hit 
4 objects which might be in different PGs. So the read speed should be 
increased. Maybe I got that part wrong :-)
I might have the same speed improvement when using an object size of 1MB 
directly on the image.

Cheers
Nick

On Thursday, August 18, 2016 01:37:46 PM Nick Fisk wrote:
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> > nick Sent: 18 August 2016 12:39
> > To: nick@xxxxxxxxxx
> > Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>
> > Subject: Re:  Ceph all NVME Cluster sequential read speed
> > 
> > So after disabling logging and setting intel_idle.max_cstate=1 we reach
> > 1953 IOPS for 4k blocksizes (with an iodepth of 1) instead
> of
> 
> > 1382. This is an increase of 41%. Very cool.
> > 
> > Furthermore I played a bit with striping in RBD images. When choosing a
> > 1MB stripe unit and a stripe count of 4 there is a huge difference when
> > benchmarking with bigger block sizes (with 4MB blocksize I get twice the
> > speed). Benchmarking this with 4k blocksizes I can see almost no
> > difference to the default images (stripe-unit=4M and stripe-count=1).
> > 
> > Did anyone else play with different stripe units in the images? I guess
> > the stripe unit depends on the expected work pattern in
> the
> 
> > virtual machine.
> 
> The RBD is already striped in object sized chunks, the difference to RAID
> stripes is the size of the chunks/objects involved. A RAID array might
> chunk into 64kb chunks, this will mean that even a small readahead will
> likely cause a read across all chunks of the stripe, giving very good
> performance. In Ceph, the chunks are 4MB which means if you want to read
> across multiple objects, you will need a readahead at least bigger than
> 4MB.
> 
> The image level striping is more to do with lowering contention on a single
> PG, rather than to improve sequential performance. Ie you might have a
> couple of MB worth of data that is being hit by thousands of IO requests.
> By using striping you can try and spread these requests over more PG's.
> There is a point in the data path of a PG that is effectively single
> threaded.
> 
> If you want to improve sequential reads you want to use buffered IO and use
> a large read ahead (>16M).
> > Cheers
> > Nick
> > 
> > On Thursday, August 18, 2016 10:23:34 AM Nick Fisk wrote:
> > > > -----Original Message-----
> > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > > Behalf Of wido@xxxxxxxx Sent: 18 August 2016 09:35
> > > > To: nick <nick@xxxxxxx>
> > > > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > > > Subject: Re:  Ceph all NVME Cluster sequential read
> > > > speed
> > > > 
> > > > > Op 18 aug. 2016 om 10:15 heeft nick <nick@xxxxxxx> het volgende
> > > > > geschreven:
> > > > > 
> > > > > Hi,
> > > > > we are currently building a new ceph cluster with only NVME devices.
> > > > > One Node consists of 4x Intel P3600 2TB devices. Journal and
> > > > > filestore are on the same device. Each server has a 10 core CPU
> > > > > and uses 10 GBit ethernet NICs for public and ceph storage
> > > > > traffic. We are currently testing with 4 nodes overall.
> > > > > 
> > > > > The cluster will be used only for virtual machine images via RBD.
> > > > > The pools are replicated (no EC).
> > > > > 
> > > > > Altough we are pretty happy with the single threaded write
> > > > > performance, the single threaded (iodepth=1) sequential read
> > > > > performance is a bit disappointing.
> > > > > 
> > > > > We are testing with fio and the rbd engine. After creating a 10GB
> > > > > RBD image, we use the following fio params to test:
> > > > > """
> > > > > [global]
> > > > > invalidate=1
> > > > > ioengine=rbd
> > > > > iodepth=1
> > > > > ramp_time=2
> > > > > size=2G
> > > > > bs=4k
> > > > > direct=1
> > > > > buffered=0
> > > > > """
> > > > > 
> > > > > For a 4k workload we are reaching 1382 IOPS. Testing one NVME
> > > > > device directly (with psync engine and iodepth of 1) we can reach
> > > > > up to 84176 IOPS. This is a big difference.
> > > > 
> > > > Network is a big difference as well. Keep in mind the Ceph OSDs have
> > > > to process the I/O as well.
> > > > 
> > > > For example, if you have a network latency of 0.200ms, in 1.000ms (1
> > > > sec) you will be able to potentially do 5.000 IOps, but that
> > > 
> > > is
> > > 
> > > > without the OSD or any other layers doing any work.
> > > > 
> > > > > I already read that the read_ahead setting might improve the
> > > > > situation, although this would only be true when using buffered
> > > > > reads, right?
> > > > > 
> > > > > Does anyone have other suggestions to get better serial read
> > > > > performance?
> > > > 
> > > > You might want to disable all logging and look at AsyncMessenger.
> > > > Disabling cephx might help, but that is not very safe to do.
> > > 
> > > Just to add what Wido has mentioned. The problem is latency
> > > serialisation, the effect of the network, ceph code means that each IO
> > > request has to travel further than if you are comparing to a local SATA
> > > cable.
> > > 
> > > The trick is to try and remove as much of this as possible where you
> > > can.
> > > Wido has mentioned 1 good option of turning off logging. One thing I
> > > have found which helps massively is to force the CPU c-state to 1 and
> > > pin the CPU's at their max frequency. Otherwise the CPU's can spend up
> > > to 200us waking up from deep sleep several times every IO. Doing this
> > > I managed to get my 4kb write latency for a 3x replica pool down to
> > > 600us!!
> > > 
> > > So stick this on your kernel boot line
> > > 
> > > intel_idle.max_cstate=1
> > > 
> > > and stick this somewhere like your rc.local
> > > 
> > > echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
> > > 
> > > Although there maybe some gains to setting it to 90-95%, so that when
> > > only 1 core is active it can turbo slightly higher.
> > > 
> > > Also since you are using the RBD engine in fio you should be able to
> > > use readahead caching with directio. You just need to enable it in
> > > your ceph.conf on the client machine where you are running fio.
> > > 
> > > Nick
> > > 
> > > > Wido
> > > > 
> > > > > Cheers
> > > > > Nick
> > > > > 
> > > > > --
> > > > > Sebastian Nickel
> > > > > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
> > > > > Tel
> > > > > +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > 
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > --
> > Sebastian Nickel
> > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel +41
> > 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

-- 
Sebastian Nickel
Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
Attachment:
signature.asc

Description: This is a digitally signed message part.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com