On 2/7/21 2:43 PM, Mark Nelson wrote: > Hi Wido, > > > Long time no see! Yeah, this kind of use case is painful for ceph. It > really hurts having to send off synchronous replica writes to other OSDs > and having to wait to ack the client after *all* have completed. We're > only as fast as the slowest replica write for each IO. We can't > parallelize thigns at all since it's literally a single client doing a > single IO at a time. I don't doubt your ZFS results, it sounds about > right given the difference in the IO path. FWIW, I've been doing quite > a bit of work lately poking at crimson with the "alienized" bluestore > implementation using fio+librbd against a single NVMe backed OSD running > on localhost. We're not as fast as the classic OSD for large parallel > workloads because we only have a single reactor thread right now. We'll > need multiple reactors before we can match classic at high queue > depths. At lower queue depths crimson is actually faster though, > despite not even using a seastar native objectstore yet. > So the conclusion here kind of is that right now the ~1600 IOps I'm seeing with size=3 over a 10GbE Ethernet netwerk is about the best you can get at the moment Things that might improve it a bit: - 25GbE (slightly lower latency) - Higher clocked CPUs for the OSDs That might bring me to ~1800 IOps, but that's about it, right? > For fun I just did a quick 30s QD1 test on our newer test hardware that > Intel donated for the community lab. (Xeon Platinum CPUs, P4510 NVMe > drives, etc). This is "almost master" with a couple of additional > crimson PRs using a single 16GB pre-allocated RBD volume and all of the > system level optimizations you can imagine (no c/p state transitions, > ceph-osd pinned to a specific set of cores, fio on localhost, no > replication, etc), so it's pretty best case in terms of squeezing out > performance. First, here's the classic OSD with bluestore on master: > > > [nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd --direct=1 > --bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite --norandommap > --size=16384M --numjobs=1 --runtime=30 --time_based --clientname=admin > --pool=cbt-librbd --rbdname=`hostname -f`-0 --invalidate=0 > --name=cbt-librbd/`hostname -f`-0-0 > cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R) > 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1 > > Starting 1 process > Jobs: 1 (f=1): [w(1)][100.0%][w=14.5MiB/s][w=3723 IOPS][eta 00m:00s] > cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err= 0: > pid=2085951: Sun Feb 7 12:55:19 2021 > write: IOPS=3804, BW=14.9MiB/s (15.6MB/s)(446MiB/30001msec); 0 zone > resets > slat (usec): min=2, max=474, avg= 6.85, stdev= 2.73 > clat (usec): min=187, max=2658, avg=255.05, stdev=81.21 > lat (usec): min=192, max=2704, avg=261.90, stdev=81.75 > clat percentiles (usec): > | 1.00th=[ 200], 5.00th=[ 204], 10.00th=[ 210], 20.00th=[ 219], > | 30.00th=[ 225], 40.00th=[ 231], 50.00th=[ 235], 60.00th=[ 239], > | 70.00th=[ 245], 80.00th=[ 253], 90.00th=[ 293], 95.00th=[ 502], > | 99.00th=[ 570], 99.50th=[ 611], 99.90th=[ 963], 99.95th=[ 988], > | 99.99th=[ 1401] > bw ( KiB/s): min=14704, max=16104, per=100.00%, avg=15234.44, > stdev=260.42, samples=59 > iops : min= 3676, max= 4026, avg=3808.61, stdev=65.10, samples=59 > lat (usec) : 250=76.75%, 500=18.23%, 750=4.86%, 1000=0.12% > lat (msec) : 2=0.03%, 4=0.01% > cpu : usr=3.12%, sys=2.55%, ctx=114166, majf=0, minf=184 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>=64=0.0% > issued rwts: total=0,114148,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > WRITE: bw=14.9MiB/s (15.6MB/s), 14.9MiB/s-14.9MiB/s > (15.6MB/s-15.6MB/s), io=446MiB (468MB), run=30001-30001msec > > Disk stats (read/write): > dm-2: ios=16/103, merge=0/0, ticks=1/24, in_queue=25, util=0.03%, > aggrios=31/426, aggrmerge=0/115, aggrticks=4/68, aggrin_queue=76, > aggrutil=0.30% > sda: ios=31/426, merge=0/115, ticks=4/68, in_queue=76, util=0.30% > > > > IMHO this is about as high as we can get right now on classic with > everything stacked in our favor. Latency increases quickly once you > involve remote network clients, multiple OSDs, and replication, etc. > Here's the same test with crimson using alienized bluestore: > > > > [nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd --direct=1 > --bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite --norandommap > --size=16384M --numjobs=1 --runtime=30 --time_based --clientname=admin > --pool=cbt-librbd --rbdname=`hostname -f`-0 --invalidate=0 > --name=cbt-librbd/`hostname -f`-0-0 > cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R) > 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1 > > Starting 1 process > Jobs: 1 (f=1): [w(1)][100.0%][w=22.0MiB/s][w=5886 IOPS][eta 00m:00s] > cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err= 0: > pid=2075392: Sun Feb 7 12:44:32 2021 > write: IOPS=5647, BW=22.1MiB/s (23.1MB/s)(662MiB/30001msec); 0 zone > resets > slat (usec): min=2, max=527, avg= 2.93, stdev= 1.34 > clat (usec): min=138, max=41367, avg=173.81, stdev=232.51 > lat (usec): min=141, max=41370, avg=176.74, stdev=232.52 > clat percentiles (usec): > | 1.00th=[ 153], 5.00th=[ 155], 10.00th=[ 155], 20.00th=[ 157], > | 30.00th=[ 159], 40.00th=[ 161], 50.00th=[ 163], 60.00th=[ 163], > | 70.00th=[ 165], 80.00th=[ 169], 90.00th=[ 176], 95.00th=[ 210], > | 99.00th=[ 644], 99.50th=[ 725], 99.90th=[ 832], 99.95th=[ 881], > | 99.99th=[ 1221] > bw ( KiB/s): min=19024, max=23792, per=100.00%, avg=22632.32, > stdev=964.77, samples=59 > iops : min= 4756, max= 5948, avg=5658.07, stdev=241.20, > samples=59 > lat (usec) : 250=97.23%, 500=1.64%, 750=0.77%, 1000=0.33% > lat (msec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% > cpu : usr=1.91%, sys=1.78%, ctx=169457, majf=1, minf=336 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>=64=0.0% > issued rwts: total=0,169444,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > WRITE: bw=22.1MiB/s (23.1MB/s), 22.1MiB/s-22.1MiB/s > (23.1MB/s-23.1MB/s), io=662MiB (694MB), run=30001-30001msec > > Disk stats (read/write): > dm-2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, > aggrios=1/308, aggrmerge=0/11, aggrticks=0/47, aggrin_queue=51, > aggrutil=0.27% > sda: ios=1/308, merge=0/11, ticks=0/47, in_queue=51, util=0.27% > > > About 50% faster with lower latency, and we haven't even done much to > optimize it yet. It's not as fast as your ZFS results below, but at > least for a single OSD without replication I'm glad to see crimson is > getting us close to being in the same ballpark. I will note that the > crimson-osd process was using 23GB (!) in this test so it's still very > alpha code (and take the test result with a grain of salt since we're > only doing minimal QA right now). At least we have a target to maintain > and hopefully improve though as we continue to work toward stabilizing it. > Crimson is still pretty far away and thus not something which can be used today. Really good to see this though! QD=1 is still very important for many applications but also from a users perspective on how they experience Ceph. Wido > > Mark > > > On 2/5/21 7:51 AM, Wido den Hollander wrote: >> (Sending it to dev list as people might know it there) >> >> Hi, >> >> There are many talks and presentations out there about Ceph's >> performance. Ceph is great when it comes to parallel I/O, large queue >> depths and many applications sending I/O towards Ceph. >> >> One thing where Ceph isn't the fastest are 4k blocks written at Queue >> Depth 1. >> >> Some applications benefit very much from high performance/low latency >> I/O at qd=1, for example Single Threaded applications which are writing >> small files inside a VM running on RBD. >> >> With some tuning you can get to a ~700us latency for a 4k write with >> qd=1 (Replication, size=3) >> >> I benchmark this using fio: >> >> $ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. .. >> >> 700us latency means the result will be about ~1500 IOps (1000 / 0.7) >> >> When comparing this to let's say a BSD machine running ZFS that's on the >> low side. With ZFS+NVMe you'll be able to reach about somewhere between >> 7.000 and 10.000 IOps, the latency is simply much lower. >> >> My benchmarking / test setup for this: >> >> - Ceph Nautilus/Octopus (doesn't make a big difference) >> - 3x SuperMicro 1U with: >> - AMD Epyc 7302P 16-core CPU >> - 128GB DDR4 >> - 10x Samsung PM983 3,84TB >> - 10Gbit Base-T networking >> >> Things to configure/tune: >> >> - C-State pinning to 1 >> - CPU governer to performance >> - Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0) >> >> Higher clock speeds (New AMD Epyc coming in March!) help to reduce the >> latency and going towards 25Gbit/100Gbit might help as well. >> >> These are however only very small increments and might help to reduce >> the latency by another 15% or so. >> >> It doesn't bring us anywhere near the 10k IOps other applications can do. >> >> And I totally understand that replication over a TCP/IP network takes >> time and thus increases latency. >> >> The Crimson project [0] is aiming to lower the latency with many things >> like DPDK and SPDK, but this is far from finished and production ready. >> >> In the meantime, am I overseeing some things here? Can we reduce the >> latency further of the current OSDs? >> >> Reaching a ~500us latency would already be great! >> >> Thanks, >> >> Wido >> >> >> [0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/ >> _______________________________________________ >> Dev mailing list -- dev@xxxxxxx >> To unsubscribe send an email to dev-leave@xxxxxxx >> > _______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx