So after disabling logging and setting intel_idle.max_cstate=1 we reach 1953 IOPS for 4k blocksizes (with an iodepth of 1) instead of 1382. This is an increase of 41%. Very cool. Furthermore I played a bit with striping in RBD images. When choosing a 1MB stripe unit and a stripe count of 4 there is a huge difference when benchmarking with bigger block sizes (with 4MB blocksize I get twice the speed). Benchmarking this with 4k blocksizes I can see almost no difference to the default images (stripe-unit=4M and stripe-count=1). Did anyone else play with different stripe units in the images? I guess the stripe unit depends on the expected work pattern in the virtual machine. Cheers Nick On Thursday, August 18, 2016 10:23:34 AM Nick Fisk wrote: > > -----Original Message----- > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > > wido@xxxxxxxx Sent: 18 August 2016 09:35 > > To: nick <nick@xxxxxxx> > > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx> > > Subject: Re: Ceph all NVME Cluster sequential read speed > > > > > Op 18 aug. 2016 om 10:15 heeft nick <nick@xxxxxxx> het volgende > > > geschreven: > > > > > > Hi, > > > we are currently building a new ceph cluster with only NVME devices. > > > One Node consists of 4x Intel P3600 2TB devices. Journal and filestore > > > are on the same device. Each server has a 10 core CPU and uses 10 GBit > > > ethernet NICs for public and ceph storage traffic. We are currently > > > testing with 4 nodes overall. > > > > > > The cluster will be used only for virtual machine images via RBD. The > > > pools are replicated (no EC). > > > > > > Altough we are pretty happy with the single threaded write > > > performance, the single threaded (iodepth=1) sequential read > > > performance is a bit disappointing. > > > > > > We are testing with fio and the rbd engine. After creating a 10GB RBD > > > image, we use the following fio params to test: > > > """ > > > [global] > > > invalidate=1 > > > ioengine=rbd > > > iodepth=1 > > > ramp_time=2 > > > size=2G > > > bs=4k > > > direct=1 > > > buffered=0 > > > """ > > > > > > For a 4k workload we are reaching 1382 IOPS. Testing one NVME device > > > directly (with psync engine and iodepth of 1) we can reach up to 84176 > > > IOPS. This is a big difference. > > > > Network is a big difference as well. Keep in mind the Ceph OSDs have to > > process the I/O as well. > > > > For example, if you have a network latency of 0.200ms, in 1.000ms (1 sec) > > you will be able to potentially do 5.000 IOps, but that > is > > > without the OSD or any other layers doing any work. > > > > > I already read that the read_ahead setting might improve the > > > situation, although this would only be true when using buffered reads, > > > right? > > > > > > Does anyone have other suggestions to get better serial read > > > performance? > > > > You might want to disable all logging and look at AsyncMessenger. > > Disabling cephx might help, but that is not very safe to do. > Just to add what Wido has mentioned. The problem is latency serialisation, > the effect of the network, ceph code means that each IO request has to > travel further than if you are comparing to a local SATA cable. > > The trick is to try and remove as much of this as possible where you can. > Wido has mentioned 1 good option of turning off logging. One thing I have > found which helps massively is to force the CPU c-state to 1 and pin the > CPU's at their max frequency. Otherwise the CPU's can spend up to 200us > waking up from deep sleep several times every IO. Doing this I managed to > get my 4kb write latency for a 3x replica pool down to 600us!! > > So stick this on your kernel boot line > > intel_idle.max_cstate=1 > > and stick this somewhere like your rc.local > > echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct > > Although there maybe some gains to setting it to 90-95%, so that when only 1 > core is active it can turbo slightly higher. > > Also since you are using the RBD engine in fio you should be able to use > readahead caching with directio. You just need to enable it in your > ceph.conf on the client machine where you are running fio. > > Nick > > > Wido > > > > > Cheers > > > Nick > > > > > > -- > > > Sebastian Nickel > > > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel > > > +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Sebastian Nickel Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
Attachment:
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com