Re: NVMe's

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 23 Sep 2020 12:33:01 -0500

On 9/23/20 12:18 PM, Mark Nelson wrote:
On 9/23/20 10:58 AM, vitalif@xxxxxxxxxx wrote:
I have no idea how you get 66k write iops with one OSD )

I've just repeated a test by creating a test pool on one NVMe OSD 
with 8 PGs (all pinned to the same OSD with pg-upmap). Then I ran 4x 
fio randwrite q128 over 4 RBD images. I got 17k iops.

OK, in fact that's not the worst result for Ceph, but problem is that 
I only get 30k write iops when benchmarking 4 RBD images spread over 
all OSDs _in_the_same_cluster_. And there are 14 of them.

The key parts are the single OSD, the tuning, the fast HW, the massive 
aggregate io depth, and the fact that it's a fresh OSD with only this 
dataset sitting on it.  Like I said, you can't get that when you go to 
a multi-node setup with replication.  The same hardware is only 
achieving about 19K IOPS per OSD before replication in an 8 node, 64 
NVMe, 3x rep configuration.  Feel free to look at the data from the 
run yourself though (2.5MB tgz file):

https://drive.google.com/file/d/1SuFoJtxAauKTBjgrV9XB00_M3Yr9LwW1/view?usp=sharing 

In the results directory you'll see the 4 separate copies of fio doing 
4k random writes each to it's own 128GB dataset on it's own RBD volume 
via the librbd engine with iodepth=128:

[nhm@o10 id-f7e4b44f]$ grep "randwrite," output.*.localhost
output.0.localhost:cbt-librbd/o10.front.sepia.ceph.com-0-0: (g=0): 
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=rbd, iodepth=128
output.1.localhost:cbt-librbd/o10.front.sepia.ceph.com-1-0: (g=0): 
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=rbd, iodepth=128
output.2.localhost:cbt-librbd/o10.front.sepia.ceph.com-2-0: (g=0): 
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=rbd, iodepth=128
output.3.localhost:cbt-librbd/o10.front.sepia.ceph.com-3-0: (g=0): 
rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=rbd, iodepth=128

[nhm@o10 id-f7e4b44f]$ grep rbdname output.*.localhost
output.0.localhost:    "rbdname" : "o10.front.sepia.ceph.com-0",
output.1.localhost:    "rbdname" : "o10.front.sepia.ceph.com-1",
output.2.localhost:    "rbdname" : "o10.front.sepia.ceph.com-2",
output.3.localhost:    "rbdname" : "o10.front.sepia.ceph.com-3",

[nhm@o10 id-f7e4b44f]$ grep size output.*.localhost
output.0.localhost:    "size" : "131072M",
output.1.localhost:    "size" : "131072M",
output.2.localhost:    "size" : "131072M",
output.3.localhost:    "size" : "131072M",

[nhm@o10 id-f7e4b44f]$ grep IOPS output.*.localhost
output.0.localhost:  write: IOPS=15.5k, BW=60.5MiB/s 
(63.4MB/s)(17.7GiB/300003msec); 0 zone resets
output.1.localhost:  write: IOPS=15.7k, BW=61.2MiB/s 
(64.2MB/s)(17.9GiB/300005msec); 0 zone resets
output.2.localhost:  write: IOPS=15.6k, BW=60.9MiB/s 
(63.9MB/s)(17.8GiB/300005msec); 0 zone resets
output.3.localhost:  write: IOPS=19.1k, BW=74.5MiB/s 
(78.2MB/s)(21.8GiB/300006msec); 0 zone resets

timestamp is virtually identical:

[nhm@o10 id-f7e4b44f]$ grep timestamp_ms output.*.localhost
output.0.localhost:  "timestamp_ms" : 1570222502962,
output.1.localhost:  "timestamp_ms" : 1570222502965,
output.2.localhost:  "timestamp_ms" : 1570222502965,
output.3.localhost:  "timestamp_ms" : 1570222502959,

as is job runtime:

output.0.localhost:      "job_runtime" : 300002,
output.1.localhost:      "job_runtime" : 300004,
output.2.localhost:      "job_runtime" : 300004,
output.3.localhost:      "job_runtime" : 300005,

We can perform some sanity checks by looking at collectl data from the 
run:

[nhm@o10 id-f7e4b44f]$ collectl -sD -oT -p 
collectl.localhost/o10-20191004-204957.raw.gz | grep -E "nvme0n1p1|#" 
| tail -n 21
# DISK STATISTICS (/sec)
# 
<---------reads---------------><---------writes--------------><--------averages--------> 
Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged IOs 
Size  Wait  RWSize  QLen  Wait SvcTim Util
20:54:45 nvme0n1p1        0      0    0    0     0  227780  55001 
1944  117     1     117     7     1      0   39
20:54:46 nvme0n1p1        0      0    0    0     0  485056 117319 
3945  123     4     122    38     4      0   45
20:54:47 nvme0n1p1        0      0    0    0     0  409144  98947 
3339  123     7     122    52     7      0   42
20:54:48 nvme0n1p1        0      0    0    0     0  384940  93056 
3179  121     2     121    14     2      0   51
20:54:49 nvme0n1p1        0      0    0    0     0  499152 120724 
4064  123     4     122    28     4      0   55
20:54:50 nvme0n1p1        0      0    0    0     0  388560  93931 
3209  121     4     121    26     4      0   52
20:54:51 nvme0n1p1        0      0    0    0     0  316304  76415 
2661  119     5     118    26     5      0   51
20:54:52 nvme0n1p1        0      0    0    0     0  418876 107305 
3443  122     5     121    30     5      0   52
20:54:53 nvme0n1p1        0      0    0    0     0  367420  82728 
3098  119     4     118    26     4      0   50
20:54:54 nvme0n1p1        0      0    0    0     0  293444  70912 
2449  120     5     119    33     5      0   41
20:54:55 nvme0n1p1        0      0    0    0     0  300680  72687 
2483  121     6     121    39     6      0   38
20:54:56 nvme0n1p1        0      0    0    0     0  274404  66311 
2290  120     2     119    10     2      0   40
20:54:57 nvme0n1p1        0      0    0    0     0  244220  58996 
2059  119     5     118    28     5      0   39
20:54:58 nvme0n1p1        0      0    0    0     0  242116  58491 
2038  119     6     118    31     6      0   37
20:54:59 nvme0n1p1        0      0    0    0     0  436208 105512 
3540  123     5     123    40     5      0   44
20:55:00 nvme0n1p1        0      0    0    0     0  427420 103459 
3474  123     7     123    54     7      0   42
20:55:01 nvme0n1p1        0      0    0    0     0  152428  36677 
1352  113     0     112     1     0      0   36
20:55:02 nvme0n1p1        0      0    0    0     0  448740 108542 
3643  123     5     123    45     5      0   44

So in this case the block layer in the kernel was able to play some 
tricks and merge a bunch of IOs helping us get that nice juicy 
benchmark number (though these P4510 NVMe drives may have been able to 
handle more randomly scattered non-merged 4K writes relatively well 
too).  in this case we actually have a number of IOs waiting in the 
device queue meaning we are pushing the NVMe at least moderately 
hard.  Next we look at OSD process data over time:

[nhm@o10 id-f7e4b44f]$ collectl -sZ -oT -p 
collectl.localhost/o10-20191004-204957.raw.gz | awk 'NR==5 || 
/ceph-osd/' | head -n 10
#Time      PID  User     PR  PPID THRD S   VSZ   RSS CP  SysT UsrT 
Pct  AccuTime  RKB  WKB MajF MinF Command
20:50:08  7964  root     20     1   71 S    7G    6G 84 10.13 49.34 
594  09:03.66    0    0    0  75K /usr/local/bin/ceph-osd
20:50:18  7964  root     20     1   71 S    7G    6G 84 15.26 84.28 
995  10:43.21    0    0    0  42K /usr/local/bin/ceph-osd
20:50:28  7964  root     20     1   71 S    7G    6G 84 15.87 89.10 
1049  12:28.18    0    0    0  59K /usr/local/bin/ceph-osd
20:50:38  7964  root     20     1   71 S    7G    6G 84 15.46 94.56 
1100  14:18.20    0    0    0  66K /usr/local/bin/ceph-osd
20:50:48  7964  root     20     1   71 S    7G    6G 84 15.19 94.82 
1100  16:08.21    0    0    0  76K /usr/local/bin/ceph-osd
20:50:58  7964  root     20     1   71 S    7G    6G 84 14.64 92.22 
1068  17:55.07    0    0    0  52K /usr/local/bin/ceph-osd
20:51:08  7964  root     20     1   71 S    7G    6G 84 14.91 95.36 
1102  19:45.34    0    0    0  55K /usr/local/bin/ceph-osd
20:51:18  7964  root     20     1   71 S    7G    6G 84 15.48 100.92 
1164  21:41.74    0    0    0  64K /usr/local/bin/ceph-osd
20:51:28  7964  root     20     1   71 S    7G    6G 84 14.91 90.63 
1055  23:27.28    0    0    0  53K /usr/local/bin/ceph-osd

The OSD process is consuming a little under 8GB which is reasonably on 
target.  PCT used by the OSD stabilizes between about 10-12 cores.  
The important point is that the OSD/bluestore machinery is perfectly 
happy consuming a ton of cores and driving ~66K IOPs worth of work at 
the underlying device assuming the block layer and the device itself 
can handle it.  It used to be that the kv sync and onode cache triming 
threads were much bigger bottlenecks but we've improved both over the 
last couple of years.  Having said that we're getting pretty close to 
the current limits of bluestore here.  Also once you start involving 
the network and replication we introduce enough extra latency and 
contention in our stack that the OSD can no longer maintain these 
kinds of numbers even if the device itself and bluestore can handle 
it.  That's where seastore, crimson, and future msgr work come in 
imho.  Finally, we can use cbt's osd log parser to look at rocksdb 
compaction statistics.  This also probably explains at least some of 
the large IO workload we see in collectl since we are spending nearly 
1/3rd of the overall OSD uptime in compaction (and given that the test 
only ran for 300s of that uptime, rocksdb may be spending the majority 
of the actual test portion of the OSD uptime performing universal 
compaction):

[nhm@o10 nvme_nvme_bs]$ 
~/src/ceph/cbt/tools/ceph_rocksdb_log_parser.py ./osd.0.log | head -n 13
Compaction Statistics    ./osd.0.log
Total OSD Log Duration (seconds)    605.512
Number of Compaction Events    112
Avg Compaction Time (seconds)    1.68255873214
Total Compaction Time (seconds)    188.446578
Avg Output Size: (MB)    380.801597553
Total Output Size: (MB)    42649.7789259
Total Input Records    243648735
Total Output Records    174897784
Avg Output Throughput (MB/s)    250.561577229
Avg Input Records/second    1242038.3797
Avg Output Records/second    1059967.66123
Avg Output/Input Ratio    0.862494228971

Mark

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Minor correction, nvme0n1p1 was a DB/WAL partition while nvme1n1p2 is 
the block storage device where the 4k writes are happening and 
unmerged.  That makes way more sense since those merged writes are all 
just the WAL and big SST writes.  Sorry, it's been a while since I 
looked at these numbers.  IE here we are using a separate NVMe drive in 
this case for the DB/WAL.  In reality I don't recall it actually helping 
that much vs just putting everything on a single P4510, but in any event 
I wanted to update for clarity.

Actual block partition writes:

[nhm@o10 id-f7e4b44f]$ collectl -sD -oT -p 
collectl.localhost/o10-20191004-204957.raw.gz | grep -E "nvme1n1p2|#" | 
tail -n 21
# DISK STATISTICS (/sec)
# 
<---------reads---------------><---------writes--------------><--------averages--------> 
Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged IOs 
Size  Wait  RWSize  QLen  Wait SvcTim Util
20:54:45 nvme1n1p2        0      0    0    0     0  261804      0 65K    
4     0       3     0     0      0   89
20:54:46 nvme1n1p2        0      0    0    0     0  206916      0 51K    
4     0       4     0     0      0   67
20:54:47 nvme1n1p2        0      0    0    0     0  218888      0 54K    
4     0       4     0     0      0   75
20:54:48 nvme1n1p2        0      0    0    0     0  261032      0 65K    
4     0       4     0     0      0   79
20:54:49 nvme1n1p2        0      0    0    0     0  237132      0 59K    
4     0       4     0     0      0   70
20:54:50 nvme1n1p2        0      0    0    0     0  269404      0 67K    
4     0       4     0     0      0   80
20:54:51 nvme1n1p2        0      0    0    0     0  280064      0 70K    
4     0       4     0     0      0   82
20:54:52 nvme1n1p2        0      0    0    0     0  251956      0 62K    
4     0       4     0     0      0   75
20:54:53 nvme1n1p2        0      0    0    0     0  267232      0 66K    
4     0       4     0     0      0   82
20:54:54 nvme1n1p2        0      0    0    0     0  263884      0 65K    
4     0       4     0     0      0   83
20:54:55 nvme1n1p2        0      0    0    0     0  245224      0 61K    
4     0       4     0     0      0   85
20:54:56 nvme1n1p2        0      0    0    0     0  262472      0 65K    
4     0       4     0     0      0   86
20:54:57 nvme1n1p2        0      0    0    0     0  270272      0 67K    
4     0       4     0     0      0   89
20:54:58 nvme1n1p2        0      0    0    0     0  264256      0 66K    
4     0       4     0     0      0   90
20:54:59 nvme1n1p2        0      0    0    0     0  226052      0 56K    
4     0       4     0     0      0   75
20:55:00 nvme1n1p2        0      0    0    0     0  224352      0 56K    
4     0       4     0     0      0   76
20:55:01 nvme1n1p2        0      0    0    0     0  287760      0 71K    
4     0       4     0     0      0   96
20:55:02 nvme1n1p2        0      0    0    0     0  220128      0 55K    
4     0       4     0     0      0   73

Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx