Re: Increasing QD=1 performance (lowering latency)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 07/02/2021 15:43, Mark Nelson wrote:
Hi Wido,


Long time no see!  Yeah, this kind of use case is painful for ceph.  It really hurts having to send off synchronous replica writes to other OSDs and having to wait to ack the client after *all* have completed.  We're only as fast as the slowest replica write for each IO.  We can't parallelize thigns at all since it's literally a single client doing a single IO at a time.  I don't doubt your ZFS results, it sounds about right given the difference in the IO path.  FWIW, I've been doing quite a bit of work lately poking at crimson with the "alienized" bluestore implementation using fio+librbd against a single NVMe backed OSD running on localhost.  We're not as fast as the classic OSD for large parallel workloads because we only have a single reactor thread right now.  We'll need multiple reactors before we can match classic at high  queue depths.  At lower queue depths crimson is actually faster though, despite not even using a seastar native objectstore yet.

For fun I just did a quick 30s QD1 test on our newer test hardware that Intel donated for the community lab. (Xeon Platinum CPUs, P4510 NVMe drives, etc).  This is "almost master" with a couple of additional crimson PRs using a single 16GB pre-allocated RBD volume and all of the system level optimizations you can imagine (no c/p state transitions, ceph-osd pinned to a specific set of cores, fio on localhost, no replication, etc), so it's pretty best case in terms of squeezing out performance.  First, here's the classic OSD with bluestore on master:


[nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd --direct=1 --bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite --norandommap --size=16384M --numjobs=1 --runtime=30 --time_based --clientname=admin --pool=cbt-librbd --rbdname=`hostname -f`-0 --invalidate=0 --name=cbt-librbd/`hostname -f`-0-0
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1

Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=14.5MiB/s][w=3723 IOPS][eta 00m:00s]
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err= 0: pid=2085951: Sun Feb  7 12:55:19 2021
  write: IOPS=3804, BW=14.9MiB/s (15.6MB/s)(446MiB/30001msec); 0 zone resets
    slat (usec): min=2, max=474, avg= 6.85, stdev= 2.73
    clat (usec): min=187, max=2658, avg=255.05, stdev=81.21
     lat (usec): min=192, max=2704, avg=261.90, stdev=81.75
    clat percentiles (usec):
     |  1.00th=[  200],  5.00th=[  204], 10.00th=[  210], 20.00th=[  219],
     | 30.00th=[  225], 40.00th=[  231], 50.00th=[  235], 60.00th=[  239],
     | 70.00th=[  245], 80.00th=[  253], 90.00th=[  293], 95.00th=[  502],
     | 99.00th=[  570], 99.50th=[  611], 99.90th=[  963], 99.95th=[  988],
     | 99.99th=[ 1401]
   bw (  KiB/s): min=14704, max=16104, per=100.00%, avg=15234.44, stdev=260.42, samples=59
   iops        : min= 3676, max= 4026, avg=3808.61, stdev=65.10, samples=59
  lat (usec)   : 250=76.75%, 500=18.23%, 750=4.86%, 1000=0.12%
  lat (msec)   : 2=0.03%, 4=0.01%
  cpu          : usr=3.12%, sys=2.55%, ctx=114166, majf=0, minf=184
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,114148,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=14.9MiB/s (15.6MB/s), 14.9MiB/s-14.9MiB/s (15.6MB/s-15.6MB/s), io=446MiB (468MB), run=30001-30001msec

Disk stats (read/write):
    dm-2: ios=16/103, merge=0/0, ticks=1/24, in_queue=25, util=0.03%, aggrios=31/426, aggrmerge=0/115, aggrticks=4/68, aggrin_queue=76, aggrutil=0.30%
  sda: ios=31/426, merge=0/115, ticks=4/68, in_queue=76, util=0.30%



IMHO this is about as high as we can get right now on classic with everything stacked in our favor.  Latency increases quickly once you involve remote network clients, multiple OSDs, and replication, etc.   Here's the same test with crimson using alienized bluestore:



[nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd --direct=1 --bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite --norandommap --size=16384M --numjobs=1 --runtime=30 --time_based --clientname=admin --pool=cbt-librbd --rbdname=`hostname -f`-0 --invalidate=0 --name=cbt-librbd/`hostname -f`-0-0
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1

Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=22.0MiB/s][w=5886 IOPS][eta 00m:00s]
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err= 0: pid=2075392: Sun Feb  7 12:44:32 2021
  write: IOPS=5647, BW=22.1MiB/s (23.1MB/s)(662MiB/30001msec); 0 zone resets
    slat (usec): min=2, max=527, avg= 2.93, stdev= 1.34
    clat (usec): min=138, max=41367, avg=173.81, stdev=232.51
     lat (usec): min=141, max=41370, avg=176.74, stdev=232.52
    clat percentiles (usec):
     |  1.00th=[  153],  5.00th=[  155], 10.00th=[  155], 20.00th=[  157],
     | 30.00th=[  159], 40.00th=[  161], 50.00th=[  163], 60.00th=[  163],
     | 70.00th=[  165], 80.00th=[  169], 90.00th=[  176], 95.00th=[  210],
     | 99.00th=[  644], 99.50th=[  725], 99.90th=[  832], 99.95th=[  881],
     | 99.99th=[ 1221]
   bw (  KiB/s): min=19024, max=23792, per=100.00%, avg=22632.32, stdev=964.77, samples=59
   iops        : min= 4756, max= 5948, avg=5658.07, stdev=241.20, samples=59
  lat (usec)   : 250=97.23%, 500=1.64%, 750=0.77%, 1000=0.33%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=1.91%, sys=1.78%, ctx=169457, majf=1, minf=336
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,169444,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=22.1MiB/s (23.1MB/s), 22.1MiB/s-22.1MiB/s (23.1MB/s-23.1MB/s), io=662MiB (694MB), run=30001-30001msec

Disk stats (read/write):
    dm-2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=1/308, aggrmerge=0/11, aggrticks=0/47, aggrin_queue=51, aggrutil=0.27%
  sda: ios=1/308, merge=0/11, ticks=0/47, in_queue=51, util=0.27%


About 50% faster with lower latency, and we haven't even done much to optimize it yet.  It's not as fast as your ZFS results below, but at least for a single OSD without replication I'm glad to see crimson is getting us close to being in the same ballpark. I will note that the crimson-osd process was using 23GB (!) in this test so it's still very alpha code (and take the test result with a grain of salt since we're only doing minimal QA right now).  At least we have a target to maintain and hopefully improve though as we continue to work toward stabilizing it.


Mark


On 2/5/21 7:51 AM, Wido den Hollander wrote:
(Sending it to dev list as people might know it there)

Hi,

There are many talks and presentations out there about Ceph's
performance. Ceph is great when it comes to parallel I/O, large queue
depths and many applications sending I/O towards Ceph.

One thing where Ceph isn't the fastest are 4k blocks written at Queue
Depth 1.

Some applications benefit very much from high performance/low latency
I/O at qd=1, for example Single Threaded applications which are writing
small files inside a VM running on RBD.

With some tuning you can get to a ~700us latency for a 4k write with
qd=1 (Replication, size=3)

I benchmark this using fio:

$ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..

700us latency means the result will be about ~1500 IOps (1000 / 0.7)

When comparing this to let's say a BSD machine running ZFS that's on the
low side. With ZFS+NVMe you'll be able to reach about somewhere between
7.000 and 10.000 IOps, the latency is simply much lower.

My benchmarking / test setup for this:

- Ceph Nautilus/Octopus (doesn't make a big difference)
- 3x SuperMicro 1U with:
- AMD Epyc 7302P 16-core CPU
- 128GB DDR4
- 10x Samsung PM983 3,84TB
- 10Gbit Base-T networking

Things to configure/tune:

- C-State pinning to 1
- CPU governer to performance
- Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)

Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
latency and going towards 25Gbit/100Gbit might help as well.

These are however only very small increments and might help to reduce
the latency by another 15% or so.

It doesn't bring us anywhere near the 10k IOps other applications can do.

And I totally understand that replication over a TCP/IP network takes
time and thus increases latency.

The Crimson project [0] is aiming to lower the latency with many things
like DPDK and SPDK, but this is far from finished and production ready.

In the meantime, am I overseeing some things here? Can we reduce the
latency further of the current OSDs?

Reaching a ~500us latency would already be great!

Thanks,

Wido


[0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx


Hi Mark/Wido

Yes this is definitely an important use case, getting high performance without having a very large number of parallel clients/ops (as compared to the number of OSDs). I am happy to hear progress on Crimson, can't wait to see this maturing. One thing you mention on adding more threads to Crimson in the future to match classic at high queue depths, this sounds great but just in case this will lead to locks being added that may impact single queue depth case, even if slightly, i would recommend to maybe have an OSD config value to enable a single thread mode only to give (/in case it gives) the best performance for single queue depth.

I run some 1 QD tests on a test Octopus/Bluestore system using 1 OSD ram disk, only 1 replica, clients on localhost, gives approx 0.28 ms latency:

rbd bench --io-type write rbd/image-01 --io-threads=1 --io-size 4K  --io-pattern rand --rbd_cache=false

  SEC       OPS   OPS/SEC   BYTES/SEC
    1      3584   3599.33    14 MiB/s
    2      7241   3628.19    14 MiB/s
    3     10833   3616.09    14 MiB/s

fio -ioengine=rbd --name=xx --pool=rbd --rbdname=image-01 --iodepth=1 --rw=randwrite --bs=4k --direct=1 --runtime=10 --time_based

Run status group 0 (all jobs):
  WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=128MiB (135MB), run=10001-10001msec

I think ceph msgr contributes a large part of the latency

ceph_perf_msgr_server 127.0.0.1:9000 64 10
ceph_perf_msgr_client 127.0.0.1:9000 1 1 1000  10 4096
103154 us (count = 1000)  -> 103 us (count = 1)
 
this 0.1 ms latency is quite a large overhead for a simple msgr echo test, specially when compared to the latency tests below:

tcp client server socket echo test with EPOLL wait (same used in msgr wait events) , 4k block size
https://github.com/onestraw/epoll-example  
Sent 1000000 messages, avg latency 7.117855 us

tcp latency
qperf 127.0.0.1  tcp_lat  
latency  =  6.04 us

ping latency/rtt
ping -c1 -q -W1 127.0.0.1  
rtt min/avg/max/mdev = 0.018/0.018/0.018/0.000 ms

/Maged


  

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux