First and foremost, have you checked your disk controller. Of most import would be your cache battery. Any time I have a single node acting up, the controller is Suspect #1.
On Thu, Apr 5, 2018 at 11:23 AM Steven Vacaroaia <stef97@xxxxxxxxx> wrote:
Hi,_______________________________________________I have a strange issue - OSDs from a specific server are introducing huge performance issueThis is a brand new installation on 3 identical servers -DELL R620 with PERC H710 , bluestore DB and WAL on SSD, 10GB dedicated private/public networksWhen I add the OSD I see gaps like below and huge latencyatop provides no clear culprit EXCEPT very low network and specific disk utilization BUT 100% DSK for ceph-osd process which stay like that ( 100%) for the duration of the test( see below)Not sure why ceph-osd process DSK stays at 100% while all the specific DSK ( for sdb, sde ..etc) are 1% busy ?Any help/ instructions for how to troubleshooting this will be appreciated(apologies if the format is not being kept)CPU | sys 4% | user 1% | | irq 1% | | idle 794% | wait 0% | | | steal 0% | guest 0% | curf 2.20GHz | | curscal ?% |CPL | avg1 0.00 | | avg5 0.00 | avg15 0.00 | | | | csw 547/s | | intr 832/s | | | numcpu 8 | |MEM | tot 62.9G | free 61.4G | cache 520.6M | dirty 0.0M | buff 7.5M | slab 98.9M | slrec 64.8M | shmem 8.8M | shrss 0.0M | shswp 0.0M | vmbal 0.0M | | hptot 0.0M | hpuse 0.0M |SWP | tot 6.0G | free 6.0G | | | | | | | | | | vmcom 1.5G | | vmlim 37.4G |LVM | dm-0 | busy 1% | | read 0/s | write 54/s | | KiB/r 0 | KiB/w 455 | MBr/s 0.0 | | MBw/s 24.0 | avq 3.69 | | avio 0.14 ms |DSK | sdb | busy 1% | | read 0/s | write 102/s | | KiB/r 0 | KiB/w 240 | MBr/s 0.0 | | MBw/s 24.0 | avq 6.69 | | avio 0.08 ms |DSK | sda | busy 0% | | read 0/s | write 12/s | | KiB/r 0 | KiB/w 4 | MBr/s 0.0 | | MBw/s 0.1 | avq 1.00 | | avio 0.05 ms |DSK | sde | busy 0% | | read 0/s | write 0/s | | KiB/r 0 | KiB/w 0 | MBr/s 0.0 | | MBw/s 0.0 | avq 1.00 | | avio 2.50 ms |NET | transport | tcpi 718/s | tcpo 972/s | udpi 0/s | | udpo 0/s | tcpao 0/s | tcppo 0/s | tcprs 21/s | tcpie 0/s | tcpor 0/s | | udpnp 0/s | udpie 0/s |NET | network | ipi 719/s | | ipo 399/s | ipfrw 0/s | | deliv 719/s | | | | | icmpi 0/s | | icmpo 0/s |NET | eth5 1% | pcki 2214/s | pcko 939/s | | sp 10 Gbps | si 154 Mbps | so 52 Mbps | | coll 0/s | mlti 0/s | erri 0/s | erro 0/s | drpi 0/s | drpo 0/s |NET | eth4 0% | pcki 712/s | pcko 54/s | | sp 10 Gbps | si 50 Mbps | so 90 Kbps | | coll 0/s | mlti 0/s | erri 0/s | erro 0/s | drpi 0/s | drpo 0/s |PID TID RDDSK WRDSK WCANCL DSK CMD 1/212067 - 0K/s 0.0G/s 0K/s 100% ceph-osd2018-04-05 10:55:24.316549 min lat: 0.0203278 max lat: 10.7501 avg lat: 0.496822sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)40 16 1096 1080 107.988 0 - 0.49682241 16 1096 1080 105.354 0 - 0.49682242 16 1096 1080 102.846 0 - 0.49682243 16 1096 1080 100.454 0 - 0.49682244 16 1205 1189 108.079 48.4444 0.0430396 0.58812745 16 1234 1218 108.255 116 0.0318717 0.57548546 16 1234 1218 105.901 0 - 0.57548547 16 1234 1218 103.648 0 - 0.57548548 16 1234 1218 101.489 0 - 0.57548549 16 1261 1245 101.622 27 0.157469 0.60426850 16 1335 1319 105.508 296 0.191907 0.60486251 16 1418 1402 109.949 332 0.0367004 0.57342952 16 1437 1421 109.296 76 0.031818 0.56628953 16 1481 1465 110.554 176 0.0405567 0.56488554 16 1516 1500 111.099 140 0.0272873 0.55269855 16 1516 1500 109.079 0 - 0.55269856 16 1516 1500 107.131 0 - 0.55269857 16 1516 1500 105.252 0 - 0.55269858 16 1555 1539 106.127 39 0.15675 0.601747Total time run: 58.971664Total reads made: 1565Read size: 4194304Object size: 4194304Bandwidth (MB/sec): 106.153Average IOPS: 26Stddev IOPS: 33Max IOPS: 121Min IOPS: 0Average Latency(s): 0.600788Max latency(s): 10.7501Min latency(s): 0.019135megacli -LDGetProp -cache -Lall -a0Adapter 0-VD 0(target id: 0): Cache Policy:WriteThrough, ReadAheadNone, Direct, Write Cache OK if bad BBUAdapter 0-VD 1(target id: 1): Cache Policy:WriteBack, ReadAdaptive, Cached, No Write Cache if bad BBUAdapter 0-VD 2(target id: 2): Cache Policy:WriteBack, ReadAdaptive, Cached, No Write Cache if bad BBUAdapter 0-VD 3(target id: 3): Cache Policy:WriteBack, ReadAdaptive, Cached, No Write Cache if bad BBU
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com