Re: Increasing QD=1 performance (lowering latency)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Wido,


Long time no see!  Yeah, this kind of use case is painful for ceph.  It really hurts having to send off synchronous replica writes to other OSDs and having to wait to ack the client after *all* have completed.  We're only as fast as the slowest replica write for each IO.  We can't parallelize thigns at all since it's literally a single client doing a single IO at a time.  I don't doubt your ZFS results, it sounds about right given the difference in the IO path.  FWIW, I've been doing quite a bit of work lately poking at crimson with the "alienized" bluestore implementation using fio+librbd against a single NVMe backed OSD running on localhost.  We're not as fast as the classic OSD for large parallel workloads because we only have a single reactor thread right now.  We'll need multiple reactors before we can match classic at high  queue depths.  At lower queue depths crimson is actually faster though, despite not even using a seastar native objectstore yet.

For fun I just did a quick 30s QD1 test on our newer test hardware that Intel donated for the community lab. (Xeon Platinum CPUs, P4510 NVMe drives, etc).  This is "almost master" with a couple of additional crimson PRs using a single 16GB pre-allocated RBD volume and all of the system level optimizations you can imagine (no c/p state transitions, ceph-osd pinned to a specific set of cores, fio on localhost, no replication, etc), so it's pretty best case in terms of squeezing out performance.  First, here's the classic OSD with bluestore on master:


[nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd --direct=1 --bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite --norandommap --size=16384M --numjobs=1 --runtime=30 --time_based --clientname=admin --pool=cbt-librbd --rbdname=`hostname -f`-0 --invalidate=0 --name=cbt-librbd/`hostname -f`-0-0 cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1

Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=14.5MiB/s][w=3723 IOPS][eta 00m:00s]
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err= 0: pid=2085951: Sun Feb  7 12:55:19 2021   write: IOPS=3804, BW=14.9MiB/s (15.6MB/s)(446MiB/30001msec); 0 zone resets
    slat (usec): min=2, max=474, avg= 6.85, stdev= 2.73
    clat (usec): min=187, max=2658, avg=255.05, stdev=81.21
     lat (usec): min=192, max=2704, avg=261.90, stdev=81.75
    clat percentiles (usec):
     |  1.00th=[  200],  5.00th=[  204], 10.00th=[  210], 20.00th=[  219],
     | 30.00th=[  225], 40.00th=[  231], 50.00th=[  235], 60.00th=[  239],
     | 70.00th=[  245], 80.00th=[  253], 90.00th=[  293], 95.00th=[  502],
     | 99.00th=[  570], 99.50th=[  611], 99.90th=[  963], 99.95th=[  988],
     | 99.99th=[ 1401]
   bw (  KiB/s): min=14704, max=16104, per=100.00%, avg=15234.44, stdev=260.42, samples=59
   iops        : min= 3676, max= 4026, avg=3808.61, stdev=65.10, samples=59
  lat (usec)   : 250=76.75%, 500=18.23%, 750=4.86%, 1000=0.12%
  lat (msec)   : 2=0.03%, 4=0.01%
  cpu          : usr=3.12%, sys=2.55%, ctx=114166, majf=0, minf=184
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,114148,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=14.9MiB/s (15.6MB/s), 14.9MiB/s-14.9MiB/s (15.6MB/s-15.6MB/s), io=446MiB (468MB), run=30001-30001msec

Disk stats (read/write):
    dm-2: ios=16/103, merge=0/0, ticks=1/24, in_queue=25, util=0.03%, aggrios=31/426, aggrmerge=0/115, aggrticks=4/68, aggrin_queue=76, aggrutil=0.30%
  sda: ios=31/426, merge=0/115, ticks=4/68, in_queue=76, util=0.30%



IMHO this is about as high as we can get right now on classic with everything stacked in our favor.  Latency increases quickly once you involve remote network clients, multiple OSDs, and replication, etc.   Here's the same test with crimson using alienized bluestore:



[nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd --direct=1 --bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite --norandommap --size=16384M --numjobs=1 --runtime=30 --time_based --clientname=admin --pool=cbt-librbd --rbdname=`hostname -f`-0 --invalidate=0 --name=cbt-librbd/`hostname -f`-0-0 cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1

Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=22.0MiB/s][w=5886 IOPS][eta 00m:00s]
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err= 0: pid=2075392: Sun Feb  7 12:44:32 2021   write: IOPS=5647, BW=22.1MiB/s (23.1MB/s)(662MiB/30001msec); 0 zone resets
    slat (usec): min=2, max=527, avg= 2.93, stdev= 1.34
    clat (usec): min=138, max=41367, avg=173.81, stdev=232.51
     lat (usec): min=141, max=41370, avg=176.74, stdev=232.52
    clat percentiles (usec):
     |  1.00th=[  153],  5.00th=[  155], 10.00th=[  155], 20.00th=[  157],
     | 30.00th=[  159], 40.00th=[  161], 50.00th=[  163], 60.00th=[  163],
     | 70.00th=[  165], 80.00th=[  169], 90.00th=[  176], 95.00th=[  210],
     | 99.00th=[  644], 99.50th=[  725], 99.90th=[  832], 99.95th=[  881],
     | 99.99th=[ 1221]
   bw (  KiB/s): min=19024, max=23792, per=100.00%, avg=22632.32, stdev=964.77, samples=59    iops        : min= 4756, max= 5948, avg=5658.07, stdev=241.20, samples=59
  lat (usec)   : 250=97.23%, 500=1.64%, 750=0.77%, 1000=0.33%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=1.91%, sys=1.78%, ctx=169457, majf=1, minf=336
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,169444,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=22.1MiB/s (23.1MB/s), 22.1MiB/s-22.1MiB/s (23.1MB/s-23.1MB/s), io=662MiB (694MB), run=30001-30001msec

Disk stats (read/write):
    dm-2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=1/308, aggrmerge=0/11, aggrticks=0/47, aggrin_queue=51, aggrutil=0.27%
  sda: ios=1/308, merge=0/11, ticks=0/47, in_queue=51, util=0.27%


About 50% faster with lower latency, and we haven't even done much to optimize it yet.  It's not as fast as your ZFS results below, but at least for a single OSD without replication I'm glad to see crimson is getting us close to being in the same ballpark. I will note that the crimson-osd process was using 23GB (!) in this test so it's still very alpha code (and take the test result with a grain of salt since we're only doing minimal QA right now).  At least we have a target to maintain and hopefully improve though as we continue to work toward stabilizing it.


Mark


On 2/5/21 7:51 AM, Wido den Hollander wrote:
(Sending it to dev list as people might know it there)

Hi,

There are many talks and presentations out there about Ceph's
performance. Ceph is great when it comes to parallel I/O, large queue
depths and many applications sending I/O towards Ceph.

One thing where Ceph isn't the fastest are 4k blocks written at Queue
Depth 1.

Some applications benefit very much from high performance/low latency
I/O at qd=1, for example Single Threaded applications which are writing
small files inside a VM running on RBD.

With some tuning you can get to a ~700us latency for a 4k write with
qd=1 (Replication, size=3)

I benchmark this using fio:

$ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..

700us latency means the result will be about ~1500 IOps (1000 / 0.7)

When comparing this to let's say a BSD machine running ZFS that's on the
low side. With ZFS+NVMe you'll be able to reach about somewhere between
7.000 and 10.000 IOps, the latency is simply much lower.

My benchmarking / test setup for this:

- Ceph Nautilus/Octopus (doesn't make a big difference)
- 3x SuperMicro 1U with:
- AMD Epyc 7302P 16-core CPU
- 128GB DDR4
- 10x Samsung PM983 3,84TB
- 10Gbit Base-T networking

Things to configure/tune:

- C-State pinning to 1
- CPU governer to performance
- Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)

Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
latency and going towards 25Gbit/100Gbit might help as well.

These are however only very small increments and might help to reduce
the latency by another 15% or so.

It doesn't bring us anywhere near the 10k IOps other applications can do.

And I totally understand that replication over a TCP/IP network takes
time and thus increases latency.

The Crimson project [0] is aiming to lower the latency with many things
like DPDK and SPDK, but this is far from finished and production ready.

In the meantime, am I overseeing some things here? Can we reduce the
latency further of the current OSDs?

Reaching a ~500us latency would already be great!

Thanks,

Wido


[0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux