Thanks for the explanation. I thought that when using a striped image 4MB of written data will be placed in 4 objects (with 4MB object size and when using 1MB of stripe unit and a count of 4). With that a single read of 4MB will hit 4 objects which might be in different PGs. So the read speed should be increased. Maybe I got that part wrong :-) I might have the same speed improvement when using an object size of 1MB directly on the image. Cheers Nick On Thursday, August 18, 2016 01:37:46 PM Nick Fisk wrote: > > -----Original Message----- > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > > nick Sent: 18 August 2016 12:39 > > To: nick@xxxxxxxxxx > > Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx> > > Subject: Re: Ceph all NVME Cluster sequential read speed > > > > So after disabling logging and setting intel_idle.max_cstate=1 we reach > > 1953 IOPS for 4k blocksizes (with an iodepth of 1) instead > of > > > 1382. This is an increase of 41%. Very cool. > > > > Furthermore I played a bit with striping in RBD images. When choosing a > > 1MB stripe unit and a stripe count of 4 there is a huge difference when > > benchmarking with bigger block sizes (with 4MB blocksize I get twice the > > speed). Benchmarking this with 4k blocksizes I can see almost no > > difference to the default images (stripe-unit=4M and stripe-count=1). > > > > Did anyone else play with different stripe units in the images? I guess > > the stripe unit depends on the expected work pattern in > the > > > virtual machine. > > The RBD is already striped in object sized chunks, the difference to RAID > stripes is the size of the chunks/objects involved. A RAID array might > chunk into 64kb chunks, this will mean that even a small readahead will > likely cause a read across all chunks of the stripe, giving very good > performance. In Ceph, the chunks are 4MB which means if you want to read > across multiple objects, you will need a readahead at least bigger than > 4MB. > > The image level striping is more to do with lowering contention on a single > PG, rather than to improve sequential performance. Ie you might have a > couple of MB worth of data that is being hit by thousands of IO requests. > By using striping you can try and spread these requests over more PG's. > There is a point in the data path of a PG that is effectively single > threaded. > > If you want to improve sequential reads you want to use buffered IO and use > a large read ahead (>16M). > > Cheers > > Nick > > > > On Thursday, August 18, 2016 10:23:34 AM Nick Fisk wrote: > > > > -----Original Message----- > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > > > Behalf Of wido@xxxxxxxx Sent: 18 August 2016 09:35 > > > > To: nick <nick@xxxxxxx> > > > > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx> > > > > Subject: Re: Ceph all NVME Cluster sequential read > > > > speed > > > > > > > > > Op 18 aug. 2016 om 10:15 heeft nick <nick@xxxxxxx> het volgende > > > > > geschreven: > > > > > > > > > > Hi, > > > > > we are currently building a new ceph cluster with only NVME devices. > > > > > One Node consists of 4x Intel P3600 2TB devices. Journal and > > > > > filestore are on the same device. Each server has a 10 core CPU > > > > > and uses 10 GBit ethernet NICs for public and ceph storage > > > > > traffic. We are currently testing with 4 nodes overall. > > > > > > > > > > The cluster will be used only for virtual machine images via RBD. > > > > > The pools are replicated (no EC). > > > > > > > > > > Altough we are pretty happy with the single threaded write > > > > > performance, the single threaded (iodepth=1) sequential read > > > > > performance is a bit disappointing. > > > > > > > > > > We are testing with fio and the rbd engine. After creating a 10GB > > > > > RBD image, we use the following fio params to test: > > > > > """ > > > > > [global] > > > > > invalidate=1 > > > > > ioengine=rbd > > > > > iodepth=1 > > > > > ramp_time=2 > > > > > size=2G > > > > > bs=4k > > > > > direct=1 > > > > > buffered=0 > > > > > """ > > > > > > > > > > For a 4k workload we are reaching 1382 IOPS. Testing one NVME > > > > > device directly (with psync engine and iodepth of 1) we can reach > > > > > up to 84176 IOPS. This is a big difference. > > > > > > > > Network is a big difference as well. Keep in mind the Ceph OSDs have > > > > to process the I/O as well. > > > > > > > > For example, if you have a network latency of 0.200ms, in 1.000ms (1 > > > > sec) you will be able to potentially do 5.000 IOps, but that > > > > > > is > > > > > > > without the OSD or any other layers doing any work. > > > > > > > > > I already read that the read_ahead setting might improve the > > > > > situation, although this would only be true when using buffered > > > > > reads, right? > > > > > > > > > > Does anyone have other suggestions to get better serial read > > > > > performance? > > > > > > > > You might want to disable all logging and look at AsyncMessenger. > > > > Disabling cephx might help, but that is not very safe to do. > > > > > > Just to add what Wido has mentioned. The problem is latency > > > serialisation, the effect of the network, ceph code means that each IO > > > request has to travel further than if you are comparing to a local SATA > > > cable. > > > > > > The trick is to try and remove as much of this as possible where you > > > can. > > > Wido has mentioned 1 good option of turning off logging. One thing I > > > have found which helps massively is to force the CPU c-state to 1 and > > > pin the CPU's at their max frequency. Otherwise the CPU's can spend up > > > to 200us waking up from deep sleep several times every IO. Doing this > > > I managed to get my 4kb write latency for a 3x replica pool down to > > > 600us!! > > > > > > So stick this on your kernel boot line > > > > > > intel_idle.max_cstate=1 > > > > > > and stick this somewhere like your rc.local > > > > > > echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct > > > > > > Although there maybe some gains to setting it to 90-95%, so that when > > > only 1 core is active it can turbo slightly higher. > > > > > > Also since you are using the RBD engine in fio you should be able to > > > use readahead caching with directio. You just need to enable it in > > > your ceph.conf on the client machine where you are running fio. > > > > > > Nick > > > > > > > Wido > > > > > > > > > Cheers > > > > > Nick > > > > > > > > > > -- > > > > > Sebastian Nickel > > > > > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich > > > > > Tel > > > > > +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch > > > > > _______________________________________________ > > > > > ceph-users mailing list > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@xxxxxxxxxxxxxx > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > > Sebastian Nickel > > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel +41 > > 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch -- Sebastian Nickel Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
Attachment:
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com