Hi Wido,
Long time no see! Yeah, this kind of use case is painful for ceph. It
really hurts having to send off synchronous replica writes to other OSDs
and having to wait to ack the client after *all* have completed. We're
only as fast as the slowest replica write for each IO. We can't
parallelize thigns at all since it's literally a single client doing a
single IO at a time. I don't doubt your ZFS results, it sounds about
right given the difference in the IO path. FWIW, I've been doing quite
a bit of work lately poking at crimson with the "alienized" bluestore
implementation using fio+librbd against a single NVMe backed OSD running
on localhost. We're not as fast as the classic OSD for large parallel
workloads because we only have a single reactor thread right now. We'll
need multiple reactors before we can match classic at high queue
depths. At lower queue depths crimson is actually faster though,
despite not even using a seastar native objectstore yet.
For fun I just did a quick 30s QD1 test on our newer test hardware that
Intel donated for the community lab. (Xeon Platinum CPUs, P4510 NVMe
drives, etc). This is "almost master" with a couple of additional
crimson PRs using a single 16GB pre-allocated RBD volume and all of the
system level optimizations you can imagine (no c/p state transitions,
ceph-osd pinned to a specific set of cores, fio on localhost, no
replication, etc), so it's pretty best case in terms of squeezing out
performance. First, here's the classic OSD with bluestore on master:
[nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd --direct=1
--bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite --norandommap
--size=16384M --numjobs=1 --runtime=30 --time_based --clientname=admin
--pool=cbt-librbd --rbdname=`hostname -f`-0 --invalidate=0
--name=cbt-librbd/`hostname -f`-0-0
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R)
4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=14.5MiB/s][w=3723 IOPS][eta 00m:00s]
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err= 0:
pid=2085951: Sun Feb 7 12:55:19 2021
write: IOPS=3804, BW=14.9MiB/s (15.6MB/s)(446MiB/30001msec); 0 zone
resets
slat (usec): min=2, max=474, avg= 6.85, stdev= 2.73
clat (usec): min=187, max=2658, avg=255.05, stdev=81.21
lat (usec): min=192, max=2704, avg=261.90, stdev=81.75
clat percentiles (usec):
| 1.00th=[ 200], 5.00th=[ 204], 10.00th=[ 210], 20.00th=[ 219],
| 30.00th=[ 225], 40.00th=[ 231], 50.00th=[ 235], 60.00th=[ 239],
| 70.00th=[ 245], 80.00th=[ 253], 90.00th=[ 293], 95.00th=[ 502],
| 99.00th=[ 570], 99.50th=[ 611], 99.90th=[ 963], 99.95th=[ 988],
| 99.99th=[ 1401]
bw ( KiB/s): min=14704, max=16104, per=100.00%, avg=15234.44,
stdev=260.42, samples=59
iops : min= 3676, max= 4026, avg=3808.61, stdev=65.10, samples=59
lat (usec) : 250=76.75%, 500=18.23%, 750=4.86%, 1000=0.12%
lat (msec) : 2=0.03%, 4=0.01%
cpu : usr=3.12%, sys=2.55%, ctx=114166, majf=0, minf=184
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwts: total=0,114148,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=14.9MiB/s (15.6MB/s), 14.9MiB/s-14.9MiB/s
(15.6MB/s-15.6MB/s), io=446MiB (468MB), run=30001-30001msec
Disk stats (read/write):
dm-2: ios=16/103, merge=0/0, ticks=1/24, in_queue=25, util=0.03%,
aggrios=31/426, aggrmerge=0/115, aggrticks=4/68, aggrin_queue=76,
aggrutil=0.30%
sda: ios=31/426, merge=0/115, ticks=4/68, in_queue=76, util=0.30%
IMHO this is about as high as we can get right now on classic with
everything stacked in our favor. Latency increases quickly once you
involve remote network clients, multiple OSDs, and replication, etc.
Here's the same test with crimson using alienized bluestore:
[nhm@o01 initial]$ sudo /home/nhm/src/fio/fio --ioengine=rbd --direct=1
--bs=4096B --iodepth=1 --end_fsync=1 --rw=randwrite --norandommap
--size=16384M --numjobs=1 --runtime=30 --time_based --clientname=admin
--pool=cbt-librbd --rbdname=`hostname -f`-0 --invalidate=0
--name=cbt-librbd/`hostname -f`-0-0
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (g=0): rw=randwrite, bs=(R)
4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=22.0MiB/s][w=5886 IOPS][eta 00m:00s]
cbt-librbd/o01.vlan104.sepia.ceph.com-0-0: (groupid=0, jobs=1): err= 0:
pid=2075392: Sun Feb 7 12:44:32 2021
write: IOPS=5647, BW=22.1MiB/s (23.1MB/s)(662MiB/30001msec); 0 zone
resets
slat (usec): min=2, max=527, avg= 2.93, stdev= 1.34
clat (usec): min=138, max=41367, avg=173.81, stdev=232.51
lat (usec): min=141, max=41370, avg=176.74, stdev=232.52
clat percentiles (usec):
| 1.00th=[ 153], 5.00th=[ 155], 10.00th=[ 155], 20.00th=[ 157],
| 30.00th=[ 159], 40.00th=[ 161], 50.00th=[ 163], 60.00th=[ 163],
| 70.00th=[ 165], 80.00th=[ 169], 90.00th=[ 176], 95.00th=[ 210],
| 99.00th=[ 644], 99.50th=[ 725], 99.90th=[ 832], 99.95th=[ 881],
| 99.99th=[ 1221]
bw ( KiB/s): min=19024, max=23792, per=100.00%, avg=22632.32,
stdev=964.77, samples=59
iops : min= 4756, max= 5948, avg=5658.07, stdev=241.20,
samples=59
lat (usec) : 250=97.23%, 500=1.64%, 750=0.77%, 1000=0.33%
lat (msec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=1.91%, sys=1.78%, ctx=169457, majf=1, minf=336
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwts: total=0,169444,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=22.1MiB/s (23.1MB/s), 22.1MiB/s-22.1MiB/s
(23.1MB/s-23.1MB/s), io=662MiB (694MB), run=30001-30001msec
Disk stats (read/write):
dm-2: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=1/308, aggrmerge=0/11, aggrticks=0/47, aggrin_queue=51,
aggrutil=0.27%
sda: ios=1/308, merge=0/11, ticks=0/47, in_queue=51, util=0.27%
About 50% faster with lower latency, and we haven't even done much to
optimize it yet. It's not as fast as your ZFS results below, but at
least for a single OSD without replication I'm glad to see crimson is
getting us close to being in the same ballpark. I will note that the
crimson-osd process was using 23GB (!) in this test so it's still very
alpha code (and take the test result with a grain of salt since we're
only doing minimal QA right now). At least we have a target to maintain
and hopefully improve though as we continue to work toward stabilizing it.
Mark
On 2/5/21 7:51 AM, Wido den Hollander wrote:
(Sending it to dev list as people might know it there)
Hi,
There are many talks and presentations out there about Ceph's
performance. Ceph is great when it comes to parallel I/O, large queue
depths and many applications sending I/O towards Ceph.
One thing where Ceph isn't the fastest are 4k blocks written at Queue
Depth 1.
Some applications benefit very much from high performance/low latency
I/O at qd=1, for example Single Threaded applications which are writing
small files inside a VM running on RBD.
With some tuning you can get to a ~700us latency for a 4k write with
qd=1 (Replication, size=3)
I benchmark this using fio:
$ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..
700us latency means the result will be about ~1500 IOps (1000 / 0.7)
When comparing this to let's say a BSD machine running ZFS that's on the
low side. With ZFS+NVMe you'll be able to reach about somewhere between
7.000 and 10.000 IOps, the latency is simply much lower.
My benchmarking / test setup for this:
- Ceph Nautilus/Octopus (doesn't make a big difference)
- 3x SuperMicro 1U with:
- AMD Epyc 7302P 16-core CPU
- 128GB DDR4
- 10x Samsung PM983 3,84TB
- 10Gbit Base-T networking
Things to configure/tune:
- C-State pinning to 1
- CPU governer to performance
- Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)
Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
latency and going towards 25Gbit/100Gbit might help as well.
These are however only very small increments and might help to reduce
the latency by another 15% or so.
It doesn't bring us anywhere near the 10k IOps other applications can do.
And I totally understand that replication over a TCP/IP network takes
time and thus increases latency.
The Crimson project [0] is aiming to lower the latency with many things
like DPDK and SPDK, but this is far from finished and production ready.
In the meantime, am I overseeing some things here? Can we reduce the
latency further of the current OSDs?
Reaching a ~500us latency would already be great!
Thanks,
Wido
[0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx