Re: Increasing QD=1 performance (lowering latency)

Wido den Hollander <wido@xxxxxxxx> · Thu, 11 Feb 2021 11:58:13 +0100

On 09/02/2021 20:47, Maged Mokhtar wrote:

On 07/02/2021 15:43, Mark Nelson wrote:
Hi Wido,

Long time no see!  Yeah, this kind of use case is painful for ceph.  
It really hurts having to send off synchronous replica writes to other 
OSDs and having to wait to ack the client after *all* have completed.  
We're only as fast as the slowest replica write for each IO.  We can't 
parallelize thigns at all since it's literally a single client doing a 
single IO at a time.  I don't doubt your ZFS results, it sounds about 
right given the difference in the IO path.  FWIW, I've been doing 
quite a bit of work lately poking at crimson with the "alienized" 
bluestore implementation using fio+librbd against a single NVMe backed 
OSD running on localhost.  We're not as fast as the classic OSD for 
large parallel workloads because we only have a single reactor thread 
right now.  We'll need multiple reactors before we can match classic 
at high  queue depths.  At lower queue depths crimson is actually 
faster though, despite not even using a seastar native objectstore yet.

For fun I just did a quick 30s QD1 test on our newer test hardware 
that Intel donated for the community lab. (Xeon Platinum CPUs, P4510 
NVMe drives, etc).  This is "almost master" with a couple of 
additional crimson PRs using a single 16GB pre-allocated RBD volume 
and all of the system level optimizations you can imagine (no c/p 
state transitions, ceph-osd pinned to a specific set of cores, fio on 
localhost, no replication, etc), so it's pretty best case in terms of 
squeezing out performance.  First, here's the classic OSD with 
bluestore on master:

[nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd 
--direct=1 --bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite 
--norandommap --size=16384M --numjobs=1 --runtime=30 --time_based 
--clientname=admin --pool=cbt-librbd --rbdname=`hostname -f`-0 
--invalidate=0 --name=cbt-librbd/`hostname -f`-0-0
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R) 
4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1

Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=14.5MiB/s][w=3723 IOPS][eta 00m:00s]
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err= 
0: pid=2085951: Sun Feb  7 12:55:19 2021
  write: IOPS=3804, BW=14.9MiB/s (15.6MB/s)(446MiB/30001msec); 0 zone 
resets
    slat (usec): min=2, max=474, avg= 6.85, stdev= 2.73
    clat (usec): min=187, max=2658, avg=255.05, stdev=81.21
     lat (usec): min=192, max=2704, avg=261.90, stdev=81.75
    clat percentiles (usec):
     |  1.00th=[  200],  5.00th=[  204], 10.00th=[  210], 20.00th=[  
219],
     | 30.00th=[  225], 40.00th=[  231], 50.00th=[  235], 60.00th=[  
239],
     | 70.00th=[  245], 80.00th=[  253], 90.00th=[  293], 95.00th=[  
502],
     | 99.00th=[  570], 99.50th=[  611], 99.90th=[  963], 99.95th=[  
988],
     | 99.99th=[ 1401]
   bw (  KiB/s): min=14704, max=16104, per=100.00%, avg=15234.44, 
stdev=260.42, samples=59
   iops        : min= 3676, max= 4026, avg=3808.61, stdev=65.10, 
samples=59
  lat (usec)   : 250=76.75%, 500=18.23%, 750=4.86%, 1000=0.12%
  lat (msec)   : 2=0.03%, 4=0.01%
  cpu          : usr=3.12%, sys=2.55%, ctx=114166, majf=0, minf=184
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued rwts: total=0,114148,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=14.9MiB/s (15.6MB/s), 14.9MiB/s-14.9MiB/s 
(15.6MB/s-15.6MB/s), io=446MiB (468MB), run=30001-30001msec

Disk stats (read/write):
    dm-2: ios=16/103, merge=0/0, ticks=1/24, in_queue=25, util=0.03%, 
aggrios=31/426, aggrmerge=0/115, aggrticks=4/68, aggrin_queue=76, 
aggrutil=0.30%
  sda: ios=31/426, merge=0/115, ticks=4/68, in_queue=76, util=0.30%

IMHO this is about as high as we can get right now on classic with 
everything stacked in our favor.  Latency increases quickly once you 
involve remote network clients, multiple OSDs, and replication, etc.   
Here's the same test with crimson using alienized bluestore:

[nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd 
--direct=1 --bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite 
--norandommap --size=16384M --numjobs=1 --runtime=30 --time_based 
--clientname=admin --pool=cbt-librbd --rbdname=`hostname -f`-0 
--invalidate=0 --name=cbt-librbd/`hostname -f`-0-0
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R) 
4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1

Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=22.0MiB/s][w=5886 IOPS][eta 00m:00s]
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err= 
0: pid=2075392: Sun Feb  7 12:44:32 2021
  write: IOPS=5647, BW=22.1MiB/s (23.1MB/s)(662MiB/30001msec); 0 zone 
resets
    slat (usec): min=2, max=527, avg= 2.93, stdev= 1.34
    clat (usec): min=138, max=41367, avg=173.81, stdev=232.51
     lat (usec): min=141, max=41370, avg=176.74, stdev=232.52
    clat percentiles (usec):
     |  1.00th=[  153],  5.00th=[  155], 10.00th=[  155], 20.00th=[  
157],
     | 30.00th=[  159], 40.00th=[  161], 50.00th=[  163], 60.00th=[  
163],
     | 70.00th=[  165], 80.00th=[  169], 90.00th=[  176], 95.00th=[  
210],
     | 99.00th=[  644], 99.50th=[  725], 99.90th=[  832], 99.95th=[  
881],
     | 99.99th=[ 1221]
   bw (  KiB/s): min=19024, max=23792, per=100.00%, avg=22632.32, 
stdev=964.77, samples=59
   iops        : min= 4756, max= 5948, avg=5658.07, stdev=241.20, 
samples=59
  lat (usec)   : 250=97.23%, 500=1.64%, 750=0.77%, 1000=0.33%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=1.91%, sys=1.78%, ctx=169457, majf=1, minf=336
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
     issued rwts: total=0,169444,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=22.1MiB/s (23.1MB/s), 22.1MiB/s-22.1MiB/s 
(23.1MB/s-23.1MB/s), io=662MiB (694MB), run=30001-30001msec

Disk stats (read/write):
    dm-2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
aggrios=1/308, aggrmerge=0/11, aggrticks=0/47, aggrin_queue=51, 
aggrutil=0.27%
  sda: ios=1/308, merge=0/11, ticks=0/47, in_queue=51, util=0.27%

About 50% faster with lower latency, and we haven't even done much to 
optimize it yet.  It's not as fast as your ZFS results below, but at 
least for a single OSD without replication I'm glad to see crimson is 
getting us close to being in the same ballpark. I will note that the 
crimson-osd process was using 23GB (!) in this test so it's still very 
alpha code (and take the test result with a grain of salt since we're 
only doing minimal QA right now).  At least we have a target to 
maintain and hopefully improve though as we continue to work toward 
stabilizing it.

Mark

On 2/5/21 7:51 AM, Wido den Hollander wrote:
(Sending it to dev list as people might know it there)

Hi,

There are many talks and presentations out there about Ceph's
performance. Ceph is great when it comes to parallel I/O, large queue
depths and many applications sending I/O towards Ceph.

One thing where Ceph isn't the fastest are 4k blocks written at Queue
Depth 1.

Some applications benefit very much from high performance/low latency
I/O at qd=1, for example Single Threaded applications which are writing
small files inside a VM running on RBD.

With some tuning you can get to a ~700us latency for a 4k write with
qd=1 (Replication, size=3)

I benchmark this using fio:

$ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..

700us latency means the result will be about ~1500 IOps (1000 / 0.7)

When comparing this to let's say a BSD machine running ZFS that's on the
low side. With ZFS+NVMe you'll be able to reach about somewhere between
7.000 and 10.000 IOps, the latency is simply much lower.

My benchmarking / test setup for this:

- Ceph Nautilus/Octopus (doesn't make a big difference)
- 3x SuperMicro 1U with:
- AMD Epyc 7302P 16-core CPU
- 128GB DDR4
- 10x Samsung PM983 3,84TB
- 10Gbit Base-T networking

Things to configure/tune:

- C-State pinning to 1
- CPU governer to performance
- Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)

Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
latency and going towards 25Gbit/100Gbit might help as well.

These are however only very small increments and might help to reduce
the latency by another 15% or so.

It doesn't bring us anywhere near the 10k IOps other applications can 
do.

And I totally understand that replication over a TCP/IP network takes
time and thus increases latency.

The Crimson project [0] is aiming to lower the latency with many things
like DPDK and SPDK, but this is far from finished and production ready.

In the meantime, am I overseeing some things here? Can we reduce the
latency further of the current OSDs?

Reaching a ~500us latency would already be great!

Thanks,

Wido

[0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

Hi Mark/Wido

Yes this is definitely an important use case, getting high performance 
without having a very large number of parallel clients/ops (as compared 
to the number of OSDs). I am happy to hear progress on Crimson, can't 
wait to see this maturing. One thing you mention on adding more threads 
to Crimson in the future to match classic at high queue depths, this 
sounds great but just in case this will lead to locks being added that 
may impact single queue depth case, even if slightly, i would recommend 
to maybe have an OSD config value to enable a single thread mode only to 
give (/in case it gives) the best performance for single queue depth.

Indeed. High queue depths are great to benchmark with and show millions 
of IOps. In many cases you also have high QDs, but in real life you also 
have applications which simply benefit from a lower latency. QD=1

I run some 1 QD tests on a test Octopus/Bluestore system using 1 OSD ram 
disk, only 1 replica, clients on localhost, gives approx 0.28 ms latency:

rbd bench --io-type write rbd/image-01 --io-threads=1 --io-size 4K 
  --io-pattern rand --rbd_cache=false

   SEC       OPS   OPS/SEC   BYTES/SEC
     1      3584   3599.33    14 MiB/s
     2      7241   3628.19    14 MiB/s
     3     10833   3616.09    14 MiB/s

fio -ioengine=rbd --name=xx --pool=rbd --rbdname=image-01 --iodepth=1 
--rw=randwrite --bs=4k --direct=1 --runtime=10 --time_based

Run status group 0 (all jobs):
   WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s 
(13.5MB/s-13.5MB/s), io=128MiB (135MB), run=10001-10001msec

I think ceph msgr contributes a large part of the latency

ceph_perf_msgr_server 127.0.0.1:9000 <http://127.0.0.1:9000>64 10
ceph_perf_msgr_client 127.0.0.1:9000 <http://127.0.0.1:9000>1 1 1000  10 
4096
103154 us (count = 1000)  -> 103 us (count = 1)

this 0.1 ms latency is quite a large overhead for a simple msgr echo 
test, specially when compared to the latency tests below:

tcp client server socket echo test with EPOLL wait (same used in msgr 
wait events) , 4k block size
https://github.com/onestraw/epoll-example 
<https://github.com/onestraw/epoll-example>
Sent 1000000 messages, avg latency 7.117855 us

tcp latency
qperf 127.0.0.1  tcp_lat
latency  =  6.04 us

With qperf I have 19us latency over a routed 25GbE network. So yes, the 
msgr overhead of 0.1ms for msgr traffic is pretty high.

Wido

ping latency/rtt
ping -c1 -q -W1 127.0.0.1
rtt min/avg/max/mdev = 0.018/0.018/0.018/0.000 ms

/Maged

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx