On 9/23/20 10:58 AM, vitalif@xxxxxxxxxx wrote:
I have no idea how you get 66k write iops with one OSD )
I've just repeated a test by creating a test pool on one NVMe OSD with 8 PGs (all pinned to the same OSD with pg-upmap). Then I ran 4x fio randwrite q128 over 4 RBD images. I got 17k iops.
OK, in fact that's not the worst result for Ceph, but problem is that I only get 30k write iops when benchmarking 4 RBD images spread over all OSDs _in_the_same_cluster_. And there are 14 of them.
The key parts are the single OSD, the tuning, the fast HW, the massive
aggregate io depth, and the fact that it's a fresh OSD with only this
dataset sitting on it. Like I said, you can't get that when you go to a
multi-node setup with replication. The same hardware is only achieving
about 19K IOPS per OSD before replication in an 8 node, 64 NVMe, 3x rep
configuration. Feel free to look at the data from the run yourself
though (2.5MB tgz file):
https://drive.google.com/file/d/1SuFoJtxAauKTBjgrV9XB00_M3Yr9LwW1/view?usp=sharing
In the results directory you'll see the 4 separate copies of fio doing
4k random writes each to it's own 128GB dataset on it's own RBD volume
via the librbd engine with iodepth=128:
[nhm@o10 id-f7e4b44f]$ grep "randwrite," output.*.localhost
output.0.localhost:cbt-librbd/o10.front.sepia.ceph.com-0-0: (g=0):
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,
ioengine=rbd, iodepth=128
output.1.localhost:cbt-librbd/o10.front.sepia.ceph.com-1-0: (g=0):
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,
ioengine=rbd, iodepth=128
output.2.localhost:cbt-librbd/o10.front.sepia.ceph.com-2-0: (g=0):
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,
ioengine=rbd, iodepth=128
output.3.localhost:cbt-librbd/o10.front.sepia.ceph.com-3-0: (g=0):
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,
ioengine=rbd, iodepth=128
[nhm@o10 id-f7e4b44f]$ grep rbdname output.*.localhost
output.0.localhost: "rbdname" : "o10.front.sepia.ceph.com-0",
output.1.localhost: "rbdname" : "o10.front.sepia.ceph.com-1",
output.2.localhost: "rbdname" : "o10.front.sepia.ceph.com-2",
output.3.localhost: "rbdname" : "o10.front.sepia.ceph.com-3",
[nhm@o10 id-f7e4b44f]$ grep size output.*.localhost
output.0.localhost: "size" : "131072M",
output.1.localhost: "size" : "131072M",
output.2.localhost: "size" : "131072M",
output.3.localhost: "size" : "131072M",
[nhm@o10 id-f7e4b44f]$ grep IOPS output.*.localhost
output.0.localhost: write: IOPS=15.5k, BW=60.5MiB/s
(63.4MB/s)(17.7GiB/300003msec); 0 zone resets
output.1.localhost: write: IOPS=15.7k, BW=61.2MiB/s
(64.2MB/s)(17.9GiB/300005msec); 0 zone resets
output.2.localhost: write: IOPS=15.6k, BW=60.9MiB/s
(63.9MB/s)(17.8GiB/300005msec); 0 zone resets
output.3.localhost: write: IOPS=19.1k, BW=74.5MiB/s
(78.2MB/s)(21.8GiB/300006msec); 0 zone resets
timestamp is virtually identical:
[nhm@o10 id-f7e4b44f]$ grep timestamp_ms output.*.localhost
output.0.localhost: "timestamp_ms" : 1570222502962,
output.1.localhost: "timestamp_ms" : 1570222502965,
output.2.localhost: "timestamp_ms" : 1570222502965,
output.3.localhost: "timestamp_ms" : 1570222502959,
as is job runtime:
output.0.localhost: "job_runtime" : 300002,
output.1.localhost: "job_runtime" : 300004,
output.2.localhost: "job_runtime" : 300004,
output.3.localhost: "job_runtime" : 300005,
We can perform some sanity checks by looking at collectl data from the run:
[nhm@o10 id-f7e4b44f]$ collectl -sD -oT -p
collectl.localhost/o10-20191004-204957.raw.gz | grep -E "nvme0n1p1|#" |
tail -n 21
# DISK STATISTICS (/sec)
#
<---------reads---------------><---------writes--------------><--------averages-------->
Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs
Size Wait RWSize QLen Wait SvcTim Util
20:54:45 nvme0n1p1 0 0 0 0 0 227780 55001 1944
117 1 117 7 1 0 39
20:54:46 nvme0n1p1 0 0 0 0 0 485056 117319 3945
123 4 122 38 4 0 45
20:54:47 nvme0n1p1 0 0 0 0 0 409144 98947 3339
123 7 122 52 7 0 42
20:54:48 nvme0n1p1 0 0 0 0 0 384940 93056 3179
121 2 121 14 2 0 51
20:54:49 nvme0n1p1 0 0 0 0 0 499152 120724 4064
123 4 122 28 4 0 55
20:54:50 nvme0n1p1 0 0 0 0 0 388560 93931 3209
121 4 121 26 4 0 52
20:54:51 nvme0n1p1 0 0 0 0 0 316304 76415 2661
119 5 118 26 5 0 51
20:54:52 nvme0n1p1 0 0 0 0 0 418876 107305 3443
122 5 121 30 5 0 52
20:54:53 nvme0n1p1 0 0 0 0 0 367420 82728 3098
119 4 118 26 4 0 50
20:54:54 nvme0n1p1 0 0 0 0 0 293444 70912 2449
120 5 119 33 5 0 41
20:54:55 nvme0n1p1 0 0 0 0 0 300680 72687 2483
121 6 121 39 6 0 38
20:54:56 nvme0n1p1 0 0 0 0 0 274404 66311 2290
120 2 119 10 2 0 40
20:54:57 nvme0n1p1 0 0 0 0 0 244220 58996 2059
119 5 118 28 5 0 39
20:54:58 nvme0n1p1 0 0 0 0 0 242116 58491 2038
119 6 118 31 6 0 37
20:54:59 nvme0n1p1 0 0 0 0 0 436208 105512 3540
123 5 123 40 5 0 44
20:55:00 nvme0n1p1 0 0 0 0 0 427420 103459 3474
123 7 123 54 7 0 42
20:55:01 nvme0n1p1 0 0 0 0 0 152428 36677 1352
113 0 112 1 0 0 36
20:55:02 nvme0n1p1 0 0 0 0 0 448740 108542 3643
123 5 123 45 5 0 44
So in this case the block layer in the kernel was able to play some
tricks and merge a bunch of IOs helping us get that nice juicy benchmark
number (though these P4510 NVMe drives may have been able to handle more
randomly scattered non-merged 4K writes relatively well too). in this
case we actually have a number of IOs waiting in the device queue
meaning we are pushing the NVMe at least moderately hard. Next we look
at OSD process data over time:
[nhm@o10 id-f7e4b44f]$ collectl -sZ -oT -p
collectl.localhost/o10-20191004-204957.raw.gz | awk 'NR==5 ||
/ceph-osd/' | head -n 10
#Time PID User PR PPID THRD S VSZ RSS CP SysT UsrT Pct
AccuTime RKB WKB MajF MinF Command
20:50:08 7964 root 20 1 71 S 7G 6G 84 10.13 49.34 594
09:03.66 0 0 0 75K /usr/local/bin/ceph-osd
20:50:18 7964 root 20 1 71 S 7G 6G 84 15.26 84.28 995
10:43.21 0 0 0 42K /usr/local/bin/ceph-osd
20:50:28 7964 root 20 1 71 S 7G 6G 84 15.87 89.10
1049 12:28.18 0 0 0 59K /usr/local/bin/ceph-osd
20:50:38 7964 root 20 1 71 S 7G 6G 84 15.46 94.56
1100 14:18.20 0 0 0 66K /usr/local/bin/ceph-osd
20:50:48 7964 root 20 1 71 S 7G 6G 84 15.19 94.82
1100 16:08.21 0 0 0 76K /usr/local/bin/ceph-osd
20:50:58 7964 root 20 1 71 S 7G 6G 84 14.64 92.22
1068 17:55.07 0 0 0 52K /usr/local/bin/ceph-osd
20:51:08 7964 root 20 1 71 S 7G 6G 84 14.91 95.36
1102 19:45.34 0 0 0 55K /usr/local/bin/ceph-osd
20:51:18 7964 root 20 1 71 S 7G 6G 84 15.48 100.92
1164 21:41.74 0 0 0 64K /usr/local/bin/ceph-osd
20:51:28 7964 root 20 1 71 S 7G 6G 84 14.91 90.63
1055 23:27.28 0 0 0 53K /usr/local/bin/ceph-osd
The OSD process is consuming a little under 8GB which is reasonably on
target. PCT used by the OSD stabilizes between about 10-12 cores. The
important point is that the OSD/bluestore machinery is perfectly happy
consuming a ton of cores and driving ~66K IOPs worth of work at the
underlying device assuming the block layer and the device itself can
handle it. It used to be that the kv sync and onode cache triming
threads were much bigger bottlenecks but we've improved both over the
last couple of years. Having said that we're getting pretty close to
the current limits of bluestore here. Also once you start involving the
network and replication we introduce enough extra latency and contention
in our stack that the OSD can no longer maintain these kinds of numbers
even if the device itself and bluestore can handle it. That's where
seastore, crimson, and future msgr work come in imho. Finally, we can
use cbt's osd log parser to look at rocksdb compaction statistics. This
also probably explains at least some of the large IO workload we see in
collectl since we are spending nearly 1/3rd of the overall OSD uptime in
compaction (and given that the test only ran for 300s of that uptime,
rocksdb may be spending the majority of the actual test portion of the
OSD uptime performing universal compaction):
[nhm@o10 nvme_nvme_bs]$ ~/src/ceph/cbt/tools/ceph_rocksdb_log_parser.py
./osd.0.log | head -n 13
Compaction Statistics ./osd.0.log
Total OSD Log Duration (seconds) 605.512
Number of Compaction Events 112
Avg Compaction Time (seconds) 1.68255873214
Total Compaction Time (seconds) 188.446578
Avg Output Size: (MB) 380.801597553
Total Output Size: (MB) 42649.7789259
Total Input Records 243648735
Total Output Records 174897784
Avg Output Throughput (MB/s) 250.561577229
Avg Input Records/second 1242038.3797
Avg Output Records/second 1059967.66123
Avg Output/Input Ratio 0.862494228971
Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx