Hi,
I have a strange issue - OSDs from a specific server are
introducing huge performance issue
This is a brand new installation on 3 identical servers -
DELL R620 with PERC H710 , bluestore DB and WAL on SSD,
10GB dedicated private/public networks
When I add the OSD I see gaps like below and huge latency
atop provides no clear culprit EXCEPT very low network and
specific disk utilization BUT 100% DSK for ceph-osd process
which stay like that ( 100%) for the duration of the test
( see below)
Not sure why ceph-osd process DSK stays at 100% while all
the specific DSK ( for sdb, sde ..etc) are 1% busy ?
Any help/ instructions for how to troubleshooting this will
be appreciated
(apologies if the format is not being kept)
CPU | sys 4% | user 1% | |
irq 1% | | idle 794% | wait
0% | | | steal 0% | guest
0% | curf 2.20GHz | | curscal ?% |
CPL | avg1 0.00 | | avg5 0.00 |
avg15 0.00 | | |
| csw 547/s | | intr 832/s |
| | numcpu 8 | |
MEM | tot 62.9G | free 61.4G | cache 520.6M |
dirty 0.0M | buff 7.5M | slab 98.9M | slrec
64.8M | shmem 8.8M | shrss 0.0M | shswp 0.0M |
vmbal 0.0M | | hptot 0.0M | hpuse 0.0M
|
SWP | tot 6.0G | free 6.0G | |
| | |
| | | |
| vmcom 1.5G | | vmlim 37.4G |
LVM | dm-0 | busy 1% | |
read 0/s | write 54/s | | KiB/r
0 | KiB/w 455 | MBr/s 0.0 | | MBw/s
24.0 | avq 3.69 | | avio 0.14 ms |
DSK | sdb | busy 1% | |
read 0/s | write 102/s | | KiB/r
0 | KiB/w 240 | MBr/s 0.0 | | MBw/s
24.0 | avq 6.69 | | avio 0.08 ms |
DSK | sda | busy 0% | |
read 0/s | write 12/s | | KiB/r
0 | KiB/w 4 | MBr/s 0.0 | | MBw/s
0.1 | avq 1.00 | | avio 0.05 ms |
DSK | sde | busy 0% | |
read 0/s | write 0/s | | KiB/r
0 | KiB/w 0 | MBr/s 0.0 | | MBw/s
0.0 | avq 1.00 | | avio 2.50 ms |
NET | transport | tcpi 718/s | tcpo 972/s |
udpi 0/s | | udpo 0/s | tcpao
0/s | tcppo 0/s | tcprs 21/s | tcpie 0/s |
tcpor 0/s | | udpnp 0/s | udpie 0/s
|
NET | network | ipi 719/s | |
ipo 399/s | ipfrw 0/s | | deliv
719/s | | | |
| icmpi 0/s | | icmpo 0/s |
NET | eth5 1% | pcki 2214/s | pcko 939/s |
| sp 10 Gbps | si 154 Mbps | so 52 Mbps
| | coll 0/s | mlti 0/s | erri
0/s | erro 0/s | drpi 0/s | drpo 0/s |
NET | eth4 0% | pcki 712/s | pcko 54/s |
| sp 10 Gbps | si 50 Mbps | so 90 Kbps
| | coll 0/s | mlti 0/s | erri
0/s | erro 0/s | drpi 0/s | drpo 0/s |
PID TID
RDDSK WRDSK
WCANCL
DSK CMD 1/21
2067 -
0K/s 0.0G/s
0K/s
100% ceph-osd
2018-04-05 10:55:24.316549 min lat: 0.0203278 max lat:
10.7501 avg lat: 0.496822
sec Cur ops started finished avg MB/s cur MB/s
last lat(s) avg lat(s)
40 16 1096 1080 107.988 0
- 0.496822
41 16 1096 1080 105.354 0
- 0.496822
42 16 1096 1080 102.846 0
- 0.496822
43 16 1096 1080 100.454 0
- 0.496822
44 16 1205 1189 108.079 48.4444
0.0430396 0.588127
45 16 1234 1218 108.255 116
0.0318717 0.575485
46 16 1234 1218 105.901 0
- 0.575485
47 16 1234 1218 103.648 0
- 0.575485
48 16 1234 1218 101.489 0
- 0.575485
49 16 1261 1245 101.622 27
0.157469 0.604268
50 16 1335 1319 105.508 296
0.191907 0.604862
51 16 1418 1402 109.949 332
0.0367004 0.573429
52 16 1437 1421 109.296 76
0.031818 0.566289
53 16 1481 1465 110.554 176
0.0405567 0.564885
54 16 1516 1500 111.099 140
0.0272873 0.552698
55 16 1516 1500 109.079 0
- 0.552698
56 16 1516 1500 107.131 0
- 0.552698
57 16 1516 1500 105.252 0
- 0.552698
58 16 1555 1539 106.127 39
0.15675 0.601747
Total time run: 58.971664
Total reads made: 1565
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 106.153
Average IOPS: 26
Stddev IOPS: 33
Max IOPS: 121
Min IOPS: 0
Average Latency(s): 0.600788
Max latency(s): 10.7501
Min latency(s): 0.019135
megacli -LDGetProp -cache -Lall -a0
Adapter 0-VD 0(target id: 0): Cache Policy:WriteThrough,
ReadAheadNone, Direct, Write Cache OK if bad BBU
Adapter 0-VD 1(target id: 1): Cache Policy:WriteBack,
ReadAdaptive, Cached, No Write Cache if bad BBU
Adapter 0-VD 2(target id: 2): Cache Policy:WriteBack,
ReadAdaptive, Cached, No Write Cache if bad BBU
Adapter 0-VD 3(target id: 3): Cache Policy:WriteBack,
ReadAdaptive, Cached, No Write Cache if bad BBU