Re: Increasing QD=1 performance (lowering latency)

Wido den Hollander <wido@xxxxxxxx> · Tue, 9 Feb 2021 08:52:51 +0100

On 2/7/21 2:43 PM, Mark Nelson wrote:
> Hi Wido,
> 
> 
> Long time no see!  Yeah, this kind of use case is painful for ceph.  It
> really hurts having to send off synchronous replica writes to other OSDs
> and having to wait to ack the client after *all* have completed.  We're
> only as fast as the slowest replica write for each IO.  We can't
> parallelize thigns at all since it's literally a single client doing a
> single IO at a time.  I don't doubt your ZFS results, it sounds about
> right given the difference in the IO path.  FWIW, I've been doing quite
> a bit of work lately poking at crimson with the "alienized" bluestore
> implementation using fio+librbd against a single NVMe backed OSD running
> on localhost.  We're not as fast as the classic OSD for large parallel
> workloads because we only have a single reactor thread right now.  We'll
> need multiple reactors before we can match classic at high  queue
> depths.  At lower queue depths crimson is actually faster though,
> despite not even using a seastar native objectstore yet.
> 

So the conclusion here kind of is that right now the ~1600 IOps I'm
seeing with size=3 over a 10GbE Ethernet netwerk is about the best you
can get at the moment

Things that might improve it a bit:

- 25GbE (slightly lower latency)
- Higher clocked CPUs for the OSDs

That might bring me to ~1800 IOps, but that's about it, right?

> For fun I just did a quick 30s QD1 test on our newer test hardware that
> Intel donated for the community lab. (Xeon Platinum CPUs, P4510 NVMe
> drives, etc).  This is "almost master" with a couple of additional
> crimson PRs using a single 16GB pre-allocated RBD volume and all of the
> system level optimizations you can imagine (no c/p state transitions,
> ceph-osd pinned to a specific set of cores, fio on localhost, no
> replication, etc), so it's pretty best case in terms of squeezing out
> performance.  First, here's the classic OSD with bluestore on master:
> 
> 
> [nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd --direct=1
> --bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite --norandommap
> --size=16384M --numjobs=1 --runtime=30 --time_based --clientname=admin
> --pool=cbt-librbd --rbdname=`hostname -f`-0 --invalidate=0
> --name=cbt-librbd/`hostname -f`-0-0
> cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R)
> 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1
> 
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=14.5MiB/s][w=3723 IOPS][eta 00m:00s]
> cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err= 0:
> pid=2085951: Sun Feb  7 12:55:19 2021
>   write: IOPS=3804, BW=14.9MiB/s (15.6MB/s)(446MiB/30001msec); 0 zone
> resets
>     slat (usec): min=2, max=474, avg= 6.85, stdev= 2.73
>     clat (usec): min=187, max=2658, avg=255.05, stdev=81.21
>      lat (usec): min=192, max=2704, avg=261.90, stdev=81.75
>     clat percentiles (usec):
>      |  1.00th=[  200],  5.00th=[  204], 10.00th=[  210], 20.00th=[  219],
>      | 30.00th=[  225], 40.00th=[  231], 50.00th=[  235], 60.00th=[  239],
>      | 70.00th=[  245], 80.00th=[  253], 90.00th=[  293], 95.00th=[  502],
>      | 99.00th=[  570], 99.50th=[  611], 99.90th=[  963], 99.95th=[  988],
>      | 99.99th=[ 1401]
>    bw (  KiB/s): min=14704, max=16104, per=100.00%, avg=15234.44,
> stdev=260.42, samples=59
>    iops        : min= 3676, max= 4026, avg=3808.61, stdev=65.10, samples=59
>   lat (usec)   : 250=76.75%, 500=18.23%, 750=4.86%, 1000=0.12%
>   lat (msec)   : 2=0.03%, 4=0.01%
>   cpu          : usr=3.12%, sys=2.55%, ctx=114166, majf=0, minf=184
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      issued rwts: total=0,114148,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>   WRITE: bw=14.9MiB/s (15.6MB/s), 14.9MiB/s-14.9MiB/s
> (15.6MB/s-15.6MB/s), io=446MiB (468MB), run=30001-30001msec
> 
> Disk stats (read/write):
>     dm-2: ios=16/103, merge=0/0, ticks=1/24, in_queue=25, util=0.03%,
> aggrios=31/426, aggrmerge=0/115, aggrticks=4/68, aggrin_queue=76,
> aggrutil=0.30%
>   sda: ios=31/426, merge=0/115, ticks=4/68, in_queue=76, util=0.30%
> 
> 
> 
> IMHO this is about as high as we can get right now on classic with
> everything stacked in our favor.  Latency increases quickly once you
> involve remote network clients, multiple OSDs, and replication, etc.  
> Here's the same test with crimson using alienized bluestore:
> 
> 
> 
> [nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd --direct=1
> --bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite --norandommap
> --size=16384M --numjobs=1 --runtime=30 --time_based --clientname=admin
> --pool=cbt-librbd --rbdname=`hostname -f`-0 --invalidate=0
> --name=cbt-librbd/`hostname -f`-0-0
> cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R)
> 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1
> 
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=22.0MiB/s][w=5886 IOPS][eta 00m:00s]
> cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err= 0:
> pid=2075392: Sun Feb  7 12:44:32 2021
>   write: IOPS=5647, BW=22.1MiB/s (23.1MB/s)(662MiB/30001msec); 0 zone
> resets
>     slat (usec): min=2, max=527, avg= 2.93, stdev= 1.34
>     clat (usec): min=138, max=41367, avg=173.81, stdev=232.51
>      lat (usec): min=141, max=41370, avg=176.74, stdev=232.52
>     clat percentiles (usec):
>      |  1.00th=[  153],  5.00th=[  155], 10.00th=[  155], 20.00th=[  157],
>      | 30.00th=[  159], 40.00th=[  161], 50.00th=[  163], 60.00th=[  163],
>      | 70.00th=[  165], 80.00th=[  169], 90.00th=[  176], 95.00th=[  210],
>      | 99.00th=[  644], 99.50th=[  725], 99.90th=[  832], 99.95th=[  881],
>      | 99.99th=[ 1221]
>    bw (  KiB/s): min=19024, max=23792, per=100.00%, avg=22632.32,
> stdev=964.77, samples=59
>    iops        : min= 4756, max= 5948, avg=5658.07, stdev=241.20,
> samples=59
>   lat (usec)   : 250=97.23%, 500=1.64%, 750=0.77%, 1000=0.33%
>   lat (msec)   : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
>   cpu          : usr=1.91%, sys=1.78%, ctx=169457, majf=1, minf=336
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      issued rwts: total=0,169444,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>   WRITE: bw=22.1MiB/s (23.1MB/s), 22.1MiB/s-22.1MiB/s
> (23.1MB/s-23.1MB/s), io=662MiB (694MB), run=30001-30001msec
> 
> Disk stats (read/write):
>     dm-2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=1/308, aggrmerge=0/11, aggrticks=0/47, aggrin_queue=51,
> aggrutil=0.27%
>   sda: ios=1/308, merge=0/11, ticks=0/47, in_queue=51, util=0.27%
> 
> 
> About 50% faster with lower latency, and we haven't even done much to
> optimize it yet.  It's not as fast as your ZFS results below, but at
> least for a single OSD without replication I'm glad to see crimson is
> getting us close to being in the same ballpark. I will note that the
> crimson-osd process was using 23GB (!) in this test so it's still very
> alpha code (and take the test result with a grain of salt since we're
> only doing minimal QA right now).  At least we have a target to maintain
> and hopefully improve though as we continue to work toward stabilizing it.
> 

Crimson is still pretty far away and thus not something which can be
used today. Really good to see this though!

QD=1 is still very important for many applications but also from a users
perspective on how they experience Ceph.

Wido

> 
> Mark
> 
> 
> On 2/5/21 7:51 AM, Wido den Hollander wrote:
>> (Sending it to dev list as people might know it there)
>>
>> Hi,
>>
>> There are many talks and presentations out there about Ceph's
>> performance. Ceph is great when it comes to parallel I/O, large queue
>> depths and many applications sending I/O towards Ceph.
>>
>> One thing where Ceph isn't the fastest are 4k blocks written at Queue
>> Depth 1.
>>
>> Some applications benefit very much from high performance/low latency
>> I/O at qd=1, for example Single Threaded applications which are writing
>> small files inside a VM running on RBD.
>>
>> With some tuning you can get to a ~700us latency for a 4k write with
>> qd=1 (Replication, size=3)
>>
>> I benchmark this using fio:
>>
>> $ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..
>>
>> 700us latency means the result will be about ~1500 IOps (1000 / 0.7)
>>
>> When comparing this to let's say a BSD machine running ZFS that's on the
>> low side. With ZFS+NVMe you'll be able to reach about somewhere between
>> 7.000 and 10.000 IOps, the latency is simply much lower.
>>
>> My benchmarking / test setup for this:
>>
>> - Ceph Nautilus/Octopus (doesn't make a big difference)
>> - 3x SuperMicro 1U with:
>> - AMD Epyc 7302P 16-core CPU
>> - 128GB DDR4
>> - 10x Samsung PM983 3,84TB
>> - 10Gbit Base-T networking
>>
>> Things to configure/tune:
>>
>> - C-State pinning to 1
>> - CPU governer to performance
>> - Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)
>>
>> Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
>> latency and going towards 25Gbit/100Gbit might help as well.
>>
>> These are however only very small increments and might help to reduce
>> the latency by another 15% or so.
>>
>> It doesn't bring us anywhere near the 10k IOps other applications can do.
>>
>> And I totally understand that replication over a TCP/IP network takes
>> time and thus increases latency.
>>
>> The Crimson project [0] is aiming to lower the latency with many things
>> like DPDK and SPDK, but this is far from finished and production ready.
>>
>> In the meantime, am I overseeing some things here? Can we reduce the
>> latency further of the current OSDs?
>>
>> Reaching a ~500us latency would already be great!
>>
>> Thanks,
>>
>> Wido
>>
>>
>> [0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
>> _______________________________________________
>> Dev mailing list -- dev@xxxxxxx
>> To unsubscribe send an email to dev-leave@xxxxxxx
>>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx