Re: NVMe's

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 23 Sep 2020 12:18:36 -0500

On 9/23/20 10:58 AM, vitalif@xxxxxxxxxx wrote:
I have no idea how you get 66k write iops with one OSD )

I've just repeated a test by creating a test pool on one NVMe OSD with 8 PGs (all pinned to the same OSD with pg-upmap). Then I ran 4x fio randwrite q128 over 4 RBD images. I got 17k iops.

OK, in fact that's not the worst result for Ceph, but problem is that I only get 30k write iops when benchmarking 4 RBD images spread over all OSDs _in_the_same_cluster_. And there are 14 of them.

The key parts are the single OSD, the tuning, the fast HW, the massive 
aggregate io depth, and the fact that it's a fresh OSD with only this 
dataset sitting on it.  Like I said, you can't get that when you go to a 
multi-node setup with replication.  The same hardware is only achieving 
about 19K IOPS per OSD before replication in an 8 node, 64 NVMe, 3x rep 
configuration.  Feel free to look at the data from the run yourself 
though (2.5MB tgz file):

https://drive.google.com/file/d/1SuFoJtxAauKTBjgrV9XB00_M3Yr9LwW1/view?usp=sharing

In the results directory you'll see the 4 separate copies of fio doing 
4k random writes each to it's own 128GB dataset on it's own RBD volume 
via the librbd engine with iodepth=128:

[nhm@o10 id-f7e4b44f]$ grep "randwrite," output.*.localhost
output.0.localhost:cbt-librbd/o10.front.sepia.ceph.com-0-0: (g=0): 
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=rbd, iodepth=128
output.1.localhost:cbt-librbd/o10.front.sepia.ceph.com-1-0: (g=0): 
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=rbd, iodepth=128
output.2.localhost:cbt-librbd/o10.front.sepia.ceph.com-2-0: (g=0): 
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=rbd, iodepth=128
output.3.localhost:cbt-librbd/o10.front.sepia.ceph.com-3-0: (g=0): 
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=rbd, iodepth=128

[nhm@o10 id-f7e4b44f]$ grep rbdname output.*.localhost
output.0.localhost:    "rbdname" : "o10.front.sepia.ceph.com-0",
output.1.localhost:    "rbdname" : "o10.front.sepia.ceph.com-1",
output.2.localhost:    "rbdname" : "o10.front.sepia.ceph.com-2",
output.3.localhost:    "rbdname" : "o10.front.sepia.ceph.com-3",

[nhm@o10 id-f7e4b44f]$ grep size output.*.localhost
output.0.localhost:    "size" : "131072M",
output.1.localhost:    "size" : "131072M",
output.2.localhost:    "size" : "131072M",
output.3.localhost:    "size" : "131072M",

[nhm@o10 id-f7e4b44f]$ grep IOPS output.*.localhost
output.0.localhost:  write: IOPS=15.5k, BW=60.5MiB/s 
(63.4MB/s)(17.7GiB/300003msec); 0 zone resets
output.1.localhost:  write: IOPS=15.7k, BW=61.2MiB/s 
(64.2MB/s)(17.9GiB/300005msec); 0 zone resets
output.2.localhost:  write: IOPS=15.6k, BW=60.9MiB/s 
(63.9MB/s)(17.8GiB/300005msec); 0 zone resets
output.3.localhost:  write: IOPS=19.1k, BW=74.5MiB/s 
(78.2MB/s)(21.8GiB/300006msec); 0 zone resets

timestamp is virtually identical:

[nhm@o10 id-f7e4b44f]$ grep timestamp_ms output.*.localhost
output.0.localhost:  "timestamp_ms" : 1570222502962,
output.1.localhost:  "timestamp_ms" : 1570222502965,
output.2.localhost:  "timestamp_ms" : 1570222502965,
output.3.localhost:  "timestamp_ms" : 1570222502959,

as is job runtime:

output.0.localhost:      "job_runtime" : 300002,
output.1.localhost:      "job_runtime" : 300004,
output.2.localhost:      "job_runtime" : 300004,
output.3.localhost:      "job_runtime" : 300005,

We can perform some sanity checks by looking at collectl data from the run:

[nhm@o10 id-f7e4b44f]$ collectl -sD -oT -p 
collectl.localhost/o10-20191004-204957.raw.gz | grep -E "nvme0n1p1|#" | 
tail -n 21
# DISK STATISTICS (/sec)
# 
<---------reads---------------><---------writes--------------><--------averages--------> 
Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged IOs 
Size  Wait  RWSize  QLen  Wait SvcTim Util
20:54:45 nvme0n1p1        0      0    0    0     0  227780  55001 1944  
117     1     117     7     1      0   39
20:54:46 nvme0n1p1        0      0    0    0     0  485056 117319 3945  
123     4     122    38     4      0   45
20:54:47 nvme0n1p1        0      0    0    0     0  409144  98947 3339  
123     7     122    52     7      0   42
20:54:48 nvme0n1p1        0      0    0    0     0  384940  93056 3179  
121     2     121    14     2      0   51
20:54:49 nvme0n1p1        0      0    0    0     0  499152 120724 4064  
123     4     122    28     4      0   55
20:54:50 nvme0n1p1        0      0    0    0     0  388560  93931 3209  
121     4     121    26     4      0   52
20:54:51 nvme0n1p1        0      0    0    0     0  316304  76415 2661  
119     5     118    26     5      0   51
20:54:52 nvme0n1p1        0      0    0    0     0  418876 107305 3443  
122     5     121    30     5      0   52
20:54:53 nvme0n1p1        0      0    0    0     0  367420  82728 3098  
119     4     118    26     4      0   50
20:54:54 nvme0n1p1        0      0    0    0     0  293444  70912 2449  
120     5     119    33     5      0   41
20:54:55 nvme0n1p1        0      0    0    0     0  300680  72687 2483  
121     6     121    39     6      0   38
20:54:56 nvme0n1p1        0      0    0    0     0  274404  66311 2290  
120     2     119    10     2      0   40
20:54:57 nvme0n1p1        0      0    0    0     0  244220  58996 2059  
119     5     118    28     5      0   39
20:54:58 nvme0n1p1        0      0    0    0     0  242116  58491 2038  
119     6     118    31     6      0   37
20:54:59 nvme0n1p1        0      0    0    0     0  436208 105512 3540  
123     5     123    40     5      0   44
20:55:00 nvme0n1p1        0      0    0    0     0  427420 103459 3474  
123     7     123    54     7      0   42
20:55:01 nvme0n1p1        0      0    0    0     0  152428  36677 1352  
113     0     112     1     0      0   36
20:55:02 nvme0n1p1        0      0    0    0     0  448740 108542 3643  
123     5     123    45     5      0   44

So in this case the block layer in the kernel was able to play some 
tricks and merge a bunch of IOs helping us get that nice juicy benchmark 
number (though these P4510 NVMe drives may have been able to handle more 
randomly scattered non-merged 4K writes relatively well too).  in this 
case we actually have a number of IOs waiting in the device queue 
meaning we are pushing the NVMe at least moderately hard.  Next we look 
at OSD process data over time:

[nhm@o10 id-f7e4b44f]$ collectl -sZ -oT -p 
collectl.localhost/o10-20191004-204957.raw.gz | awk 'NR==5 || 
/ceph-osd/' | head -n 10
#Time      PID  User     PR  PPID THRD S   VSZ   RSS CP  SysT UsrT Pct  
AccuTime  RKB  WKB MajF MinF Command
20:50:08  7964  root     20     1   71 S    7G    6G 84 10.13 49.34 594  
09:03.66    0    0    0  75K /usr/local/bin/ceph-osd
20:50:18  7964  root     20     1   71 S    7G    6G 84 15.26 84.28 995  
10:43.21    0    0    0  42K /usr/local/bin/ceph-osd
20:50:28  7964  root     20     1   71 S    7G    6G 84 15.87 89.10 
1049  12:28.18    0    0    0  59K /usr/local/bin/ceph-osd
20:50:38  7964  root     20     1   71 S    7G    6G 84 15.46 94.56 
1100  14:18.20    0    0    0  66K /usr/local/bin/ceph-osd
20:50:48  7964  root     20     1   71 S    7G    6G 84 15.19 94.82 
1100  16:08.21    0    0    0  76K /usr/local/bin/ceph-osd
20:50:58  7964  root     20     1   71 S    7G    6G 84 14.64 92.22 
1068  17:55.07    0    0    0  52K /usr/local/bin/ceph-osd
20:51:08  7964  root     20     1   71 S    7G    6G 84 14.91 95.36 
1102  19:45.34    0    0    0  55K /usr/local/bin/ceph-osd
20:51:18  7964  root     20     1   71 S    7G    6G 84 15.48 100.92 
1164  21:41.74    0    0    0  64K /usr/local/bin/ceph-osd
20:51:28  7964  root     20     1   71 S    7G    6G 84 14.91 90.63 
1055  23:27.28    0    0    0  53K /usr/local/bin/ceph-osd

The OSD process is consuming a little under 8GB which is reasonably on 
target.  PCT used by the OSD stabilizes between about 10-12 cores.  The 
important point is that the OSD/bluestore machinery is perfectly happy 
consuming a ton of cores and driving ~66K IOPs worth of work at the 
underlying device assuming the block layer and the device itself can 
handle it.  It used to be that the kv sync and onode cache triming 
threads were much bigger bottlenecks but we've improved both over the 
last couple of years.  Having said that we're getting pretty close to 
the current limits of bluestore here.  Also once you start involving the 
network and replication we introduce enough extra latency and contention 
in our stack that the OSD can no longer maintain these kinds of numbers 
even if the device itself and bluestore can handle it.  That's where 
seastore, crimson, and future msgr work come in imho.  Finally, we can 
use cbt's osd log parser to look at rocksdb compaction statistics.  This 
also probably explains at least some of the large IO workload we see in 
collectl since we are spending nearly 1/3rd of the overall OSD uptime in 
compaction (and given that the test only ran for 300s of that uptime, 
rocksdb may be spending the majority of the actual test portion of the 
OSD uptime performing universal compaction):

[nhm@o10 nvme_nvme_bs]$ ~/src/ceph/cbt/tools/ceph_rocksdb_log_parser.py 
./osd.0.log | head -n 13
Compaction Statistics    ./osd.0.log
Total OSD Log Duration (seconds)    605.512
Number of Compaction Events    112
Avg Compaction Time (seconds)    1.68255873214
Total Compaction Time (seconds)    188.446578
Avg Output Size: (MB)    380.801597553
Total Output Size: (MB)    42649.7789259
Total Input Records    243648735
Total Output Records    174897784
Avg Output Throughput (MB/s)    250.561577229
Avg Input Records/second    1242038.3797
Avg Output Records/second    1059967.66123
Avg Output/Input Ratio    0.862494228971

Mark

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx