On 9/23/20 12:18 PM, Mark Nelson wrote:
On 9/23/20 10:58 AM, vitalif@xxxxxxxxxx wrote:
I have no idea how you get 66k write iops with one OSD )
I've just repeated a test by creating a test pool on one NVMe OSD
with 8 PGs (all pinned to the same OSD with pg-upmap). Then I ran 4x
fio randwrite q128 over 4 RBD images. I got 17k iops.
OK, in fact that's not the worst result for Ceph, but problem is that
I only get 30k write iops when benchmarking 4 RBD images spread over
all OSDs _in_the_same_cluster_. And there are 14 of them.
The key parts are the single OSD, the tuning, the fast HW, the massive
aggregate io depth, and the fact that it's a fresh OSD with only this
dataset sitting on it. Like I said, you can't get that when you go to
a multi-node setup with replication. The same hardware is only
achieving about 19K IOPS per OSD before replication in an 8 node, 64
NVMe, 3x rep configuration. Feel free to look at the data from the
run yourself though (2.5MB tgz file):
https://drive.google.com/file/d/1SuFoJtxAauKTBjgrV9XB00_M3Yr9LwW1/view?usp=sharing
In the results directory you'll see the 4 separate copies of fio doing
4k random writes each to it's own 128GB dataset on it's own RBD volume
via the librbd engine with iodepth=128:
[nhm@o10 id-f7e4b44f]$ grep "randwrite," output.*.localhost
output.0.localhost:cbt-librbd/o10.front.sepia.ceph.com-0-0: (g=0):
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,
ioengine=rbd, iodepth=128
output.1.localhost:cbt-librbd/o10.front.sepia.ceph.com-1-0: (g=0):
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,
ioengine=rbd, iodepth=128
output.2.localhost:cbt-librbd/o10.front.sepia.ceph.com-2-0: (g=0):
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,
ioengine=rbd, iodepth=128
output.3.localhost:cbt-librbd/o10.front.sepia.ceph.com-3-0: (g=0):
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,
ioengine=rbd, iodepth=128
[nhm@o10 id-f7e4b44f]$ grep rbdname output.*.localhost
output.0.localhost: "rbdname" : "o10.front.sepia.ceph.com-0",
output.1.localhost: "rbdname" : "o10.front.sepia.ceph.com-1",
output.2.localhost: "rbdname" : "o10.front.sepia.ceph.com-2",
output.3.localhost: "rbdname" : "o10.front.sepia.ceph.com-3",
[nhm@o10 id-f7e4b44f]$ grep size output.*.localhost
output.0.localhost: "size" : "131072M",
output.1.localhost: "size" : "131072M",
output.2.localhost: "size" : "131072M",
output.3.localhost: "size" : "131072M",
[nhm@o10 id-f7e4b44f]$ grep IOPS output.*.localhost
output.0.localhost: write: IOPS=15.5k, BW=60.5MiB/s
(63.4MB/s)(17.7GiB/300003msec); 0 zone resets
output.1.localhost: write: IOPS=15.7k, BW=61.2MiB/s
(64.2MB/s)(17.9GiB/300005msec); 0 zone resets
output.2.localhost: write: IOPS=15.6k, BW=60.9MiB/s
(63.9MB/s)(17.8GiB/300005msec); 0 zone resets
output.3.localhost: write: IOPS=19.1k, BW=74.5MiB/s
(78.2MB/s)(21.8GiB/300006msec); 0 zone resets
timestamp is virtually identical:
[nhm@o10 id-f7e4b44f]$ grep timestamp_ms output.*.localhost
output.0.localhost: "timestamp_ms" : 1570222502962,
output.1.localhost: "timestamp_ms" : 1570222502965,
output.2.localhost: "timestamp_ms" : 1570222502965,
output.3.localhost: "timestamp_ms" : 1570222502959,
as is job runtime:
output.0.localhost: "job_runtime" : 300002,
output.1.localhost: "job_runtime" : 300004,
output.2.localhost: "job_runtime" : 300004,
output.3.localhost: "job_runtime" : 300005,
We can perform some sanity checks by looking at collectl data from the
run:
[nhm@o10 id-f7e4b44f]$ collectl -sD -oT -p
collectl.localhost/o10-20191004-204957.raw.gz | grep -E "nvme0n1p1|#"
| tail -n 21
# DISK STATISTICS (/sec)
#
<---------reads---------------><---------writes--------------><--------averages-------->
Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs
Size Wait RWSize QLen Wait SvcTim Util
20:54:45 nvme0n1p1 0 0 0 0 0 227780 55001
1944 117 1 117 7 1 0 39
20:54:46 nvme0n1p1 0 0 0 0 0 485056 117319
3945 123 4 122 38 4 0 45
20:54:47 nvme0n1p1 0 0 0 0 0 409144 98947
3339 123 7 122 52 7 0 42
20:54:48 nvme0n1p1 0 0 0 0 0 384940 93056
3179 121 2 121 14 2 0 51
20:54:49 nvme0n1p1 0 0 0 0 0 499152 120724
4064 123 4 122 28 4 0 55
20:54:50 nvme0n1p1 0 0 0 0 0 388560 93931
3209 121 4 121 26 4 0 52
20:54:51 nvme0n1p1 0 0 0 0 0 316304 76415
2661 119 5 118 26 5 0 51
20:54:52 nvme0n1p1 0 0 0 0 0 418876 107305
3443 122 5 121 30 5 0 52
20:54:53 nvme0n1p1 0 0 0 0 0 367420 82728
3098 119 4 118 26 4 0 50
20:54:54 nvme0n1p1 0 0 0 0 0 293444 70912
2449 120 5 119 33 5 0 41
20:54:55 nvme0n1p1 0 0 0 0 0 300680 72687
2483 121 6 121 39 6 0 38
20:54:56 nvme0n1p1 0 0 0 0 0 274404 66311
2290 120 2 119 10 2 0 40
20:54:57 nvme0n1p1 0 0 0 0 0 244220 58996
2059 119 5 118 28 5 0 39
20:54:58 nvme0n1p1 0 0 0 0 0 242116 58491
2038 119 6 118 31 6 0 37
20:54:59 nvme0n1p1 0 0 0 0 0 436208 105512
3540 123 5 123 40 5 0 44
20:55:00 nvme0n1p1 0 0 0 0 0 427420 103459
3474 123 7 123 54 7 0 42
20:55:01 nvme0n1p1 0 0 0 0 0 152428 36677
1352 113 0 112 1 0 0 36
20:55:02 nvme0n1p1 0 0 0 0 0 448740 108542
3643 123 5 123 45 5 0 44
So in this case the block layer in the kernel was able to play some
tricks and merge a bunch of IOs helping us get that nice juicy
benchmark number (though these P4510 NVMe drives may have been able to
handle more randomly scattered non-merged 4K writes relatively well
too). in this case we actually have a number of IOs waiting in the
device queue meaning we are pushing the NVMe at least moderately
hard. Next we look at OSD process data over time:
[nhm@o10 id-f7e4b44f]$ collectl -sZ -oT -p
collectl.localhost/o10-20191004-204957.raw.gz | awk 'NR==5 ||
/ceph-osd/' | head -n 10
#Time PID User PR PPID THRD S VSZ RSS CP SysT UsrT
Pct AccuTime RKB WKB MajF MinF Command
20:50:08 7964 root 20 1 71 S 7G 6G 84 10.13 49.34
594 09:03.66 0 0 0 75K /usr/local/bin/ceph-osd
20:50:18 7964 root 20 1 71 S 7G 6G 84 15.26 84.28
995 10:43.21 0 0 0 42K /usr/local/bin/ceph-osd
20:50:28 7964 root 20 1 71 S 7G 6G 84 15.87 89.10
1049 12:28.18 0 0 0 59K /usr/local/bin/ceph-osd
20:50:38 7964 root 20 1 71 S 7G 6G 84 15.46 94.56
1100 14:18.20 0 0 0 66K /usr/local/bin/ceph-osd
20:50:48 7964 root 20 1 71 S 7G 6G 84 15.19 94.82
1100 16:08.21 0 0 0 76K /usr/local/bin/ceph-osd
20:50:58 7964 root 20 1 71 S 7G 6G 84 14.64 92.22
1068 17:55.07 0 0 0 52K /usr/local/bin/ceph-osd
20:51:08 7964 root 20 1 71 S 7G 6G 84 14.91 95.36
1102 19:45.34 0 0 0 55K /usr/local/bin/ceph-osd
20:51:18 7964 root 20 1 71 S 7G 6G 84 15.48 100.92
1164 21:41.74 0 0 0 64K /usr/local/bin/ceph-osd
20:51:28 7964 root 20 1 71 S 7G 6G 84 14.91 90.63
1055 23:27.28 0 0 0 53K /usr/local/bin/ceph-osd
The OSD process is consuming a little under 8GB which is reasonably on
target. PCT used by the OSD stabilizes between about 10-12 cores.
The important point is that the OSD/bluestore machinery is perfectly
happy consuming a ton of cores and driving ~66K IOPs worth of work at
the underlying device assuming the block layer and the device itself
can handle it. It used to be that the kv sync and onode cache triming
threads were much bigger bottlenecks but we've improved both over the
last couple of years. Having said that we're getting pretty close to
the current limits of bluestore here. Also once you start involving
the network and replication we introduce enough extra latency and
contention in our stack that the OSD can no longer maintain these
kinds of numbers even if the device itself and bluestore can handle
it. That's where seastore, crimson, and future msgr work come in
imho. Finally, we can use cbt's osd log parser to look at rocksdb
compaction statistics. This also probably explains at least some of
the large IO workload we see in collectl since we are spending nearly
1/3rd of the overall OSD uptime in compaction (and given that the test
only ran for 300s of that uptime, rocksdb may be spending the majority
of the actual test portion of the OSD uptime performing universal
compaction):
[nhm@o10 nvme_nvme_bs]$
~/src/ceph/cbt/tools/ceph_rocksdb_log_parser.py ./osd.0.log | head -n 13
Compaction Statistics ./osd.0.log
Total OSD Log Duration (seconds) 605.512
Number of Compaction Events 112
Avg Compaction Time (seconds) 1.68255873214
Total Compaction Time (seconds) 188.446578
Avg Output Size: (MB) 380.801597553
Total Output Size: (MB) 42649.7789259
Total Input Records 243648735
Total Output Records 174897784
Avg Output Throughput (MB/s) 250.561577229
Avg Input Records/second 1242038.3797
Avg Output Records/second 1059967.66123
Avg Output/Input Ratio 0.862494228971
Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
Minor correction, nvme0n1p1 was a DB/WAL partition while nvme1n1p2 is
the block storage device where the 4k writes are happening and
unmerged. That makes way more sense since those merged writes are all
just the WAL and big SST writes. Sorry, it's been a while since I
looked at these numbers. IE here we are using a separate NVMe drive in
this case for the DB/WAL. In reality I don't recall it actually helping
that much vs just putting everything on a single P4510, but in any event
I wanted to update for clarity.
Actual block partition writes:
[nhm@o10 id-f7e4b44f]$ collectl -sD -oT -p
collectl.localhost/o10-20191004-204957.raw.gz | grep -E "nvme1n1p2|#" |
tail -n 21
# DISK STATISTICS (/sec)
#
<---------reads---------------><---------writes--------------><--------averages-------->
Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs
Size Wait RWSize QLen Wait SvcTim Util
20:54:45 nvme1n1p2 0 0 0 0 0 261804 0 65K
4 0 3 0 0 0 89
20:54:46 nvme1n1p2 0 0 0 0 0 206916 0 51K
4 0 4 0 0 0 67
20:54:47 nvme1n1p2 0 0 0 0 0 218888 0 54K
4 0 4 0 0 0 75
20:54:48 nvme1n1p2 0 0 0 0 0 261032 0 65K
4 0 4 0 0 0 79
20:54:49 nvme1n1p2 0 0 0 0 0 237132 0 59K
4 0 4 0 0 0 70
20:54:50 nvme1n1p2 0 0 0 0 0 269404 0 67K
4 0 4 0 0 0 80
20:54:51 nvme1n1p2 0 0 0 0 0 280064 0 70K
4 0 4 0 0 0 82
20:54:52 nvme1n1p2 0 0 0 0 0 251956 0 62K
4 0 4 0 0 0 75
20:54:53 nvme1n1p2 0 0 0 0 0 267232 0 66K
4 0 4 0 0 0 82
20:54:54 nvme1n1p2 0 0 0 0 0 263884 0 65K
4 0 4 0 0 0 83
20:54:55 nvme1n1p2 0 0 0 0 0 245224 0 61K
4 0 4 0 0 0 85
20:54:56 nvme1n1p2 0 0 0 0 0 262472 0 65K
4 0 4 0 0 0 86
20:54:57 nvme1n1p2 0 0 0 0 0 270272 0 67K
4 0 4 0 0 0 89
20:54:58 nvme1n1p2 0 0 0 0 0 264256 0 66K
4 0 4 0 0 0 90
20:54:59 nvme1n1p2 0 0 0 0 0 226052 0 56K
4 0 4 0 0 0 75
20:55:00 nvme1n1p2 0 0 0 0 0 224352 0 56K
4 0 4 0 0 0 76
20:55:01 nvme1n1p2 0 0 0 0 0 287760 0 71K
4 0 4 0 0 0 96
20:55:02 nvme1n1p2 0 0 0 0 0 220128 0 55K
4 0 4 0 0 0 73
Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx