16 osds: 11 up, 16 in

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Sat, 10 May 2014 16:13:34 -0700

On 5/7/14 15:33 , Dimitri Maziuk wrote:
> On 05/07/2014 04:11 PM, Craig Lewis wrote:
>> On 5/7/14 13:40 , Sergey Malinin wrote:
>>> Check dmesg and SMART data on both nodes. This behaviour is similar to
>>> failing hdd.
>>>
>>>
>> It does sound like a failing disk... but there's nothing in dmesg, and
>> smartmontools hasn't emailed me about a failing disk.  The same thing is
>> happening to more than 50% of my OSDs, in both nodes.
> check 'iostat -dmx 5 5' (or some other numbers) -- if you see 100%+ disk
> utilization, that could be the dying one.
>
A new OSD, osd.10, has started doing this.  I currently have all of the 
previously advised params (osd_max_backfill = 1, 
osd_recovery_op_priority = 1,  osd_recovery_max_active = 1) active.

I stopped the daemon, and started watching iostat

root at ceph1c:~# iostat sde -dmx 5

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s wMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sde               0.00     0.00    0.00    0.00 0.00     0.00     
0.00     0.00    0.00    0.00    0.00   0.00 0.00
sde               0.00     0.00    0.00    0.00 0.00     0.00     
0.00     0.00    0.00    0.00    0.00   0.00 0.00
# I started the osd daemon during this next sample
sde               0.00     0.00    7.60   33.20 0.81     0.92    
86.55     0.06    1.57    3.58    1.11   0.71 2.88
sde               0.00     0.00    0.00    0.00 0.00     0.00     
0.00     0.00    0.00    0.00    0.00   0.00 0.00
sde               0.00     0.00    2.20  336.00 0.01     1.46     
8.91     0.07    0.21   17.82    0.09   0.20 6.88
# During this next sample, the ceph-osd daemon started consuming exactly 
100% CPU
sde               0.00     0.00    0.40    8.40 0.00     0.36    
84.18     0.02    2.00   26.00    0.86   1.18 1.04
sde               0.00     0.00    2.20  336.00 0.01     1.46     
8.91     0.07    0.21   17.82    0.09   0.20 6.88
sde               0.00     0.00    0.40    8.40 0.00     0.36    
84.18     0.02    2.00   26.00    0.86   1.18 1.04
sde               0.00     0.00    0.00    0.00 0.00     0.00     
0.00     0.00    0.00    0.00    0.00   0.00 0.00
sde               0.00     0.00    0.00    0.00 0.00     0.00     
0.00     0.00    0.00    0.00    0.00   0.00 0.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sde               0.00     0.00    0.00   18.00 0.00     0.28    
31.73     0.02    1.11    0.00    1.11   0.04 0.08
sde               0.00     0.00    0.00    0.00 0.00     0.00     
0.00     0.00    0.00    0.00    0.00   0.00 0.00
sde               0.00     0.00    0.00    0.00 0.00     0.00     
0.00     0.00    0.00    0.00    0.00   0.00 0.00
<snip repetitive rows>
sde               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    1.20    0.00     0.08 0.00   
132.00     0.02   20.67   20.67    0.00  20.67   2.48
sde               0.00     0.00    0.40    0.00     0.03 0.00   
128.00     0.02   46.00   46.00    0.00  46.00   1.84
sde               0.00     0.00    0.20    0.00     0.01 0.00   
128.00     0.01   44.00   44.00    0.00  44.00   0.88
sde               0.00     0.00    5.00   15.60     0.41 0.82   
121.94     0.03    1.24    4.64    0.15   1.17   2.40
sde               0.00     0.00    0.00   27.40     0.00 0.17    
12.44     0.49   17.96    0.00   17.96   0.53   1.44
# The suicide timer hits in this sample or the next, and the daemon restarts
sde               0.00     0.00  113.60  261.20     2.31 1.00    
18.08     1.17    3.12   10.15    0.06   1.79  66.96
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s wMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sde               0.00     0.00  176.20  134.60     3.15 1.31    
29.40     1.79    5.77   10.12    0.08   3.16  98.16
sde               0.00     0.00  184.40    6.80     3.05 0.07    
33.46     1.94   10.15   10.53    0.00   5.10  97.52
sde               0.00     0.00  202.20   28.80     3.60 0.26    
34.26     2.06    8.92   10.18    0.06   4.09  94.40
sde               0.00     0.00  193.20   20.80     2.90 0.28    
30.43     2.02    9.44   10.43    0.15   4.58  97.92
^C

During the first cycle, there was almost no data being read or written.  
During the second cycle, I see a what looks like a normal recovery 
operation.  But the daemon still hits 100% CPU, and gets kicked out for 
being unresponsive.   The third and fourth cycles (not shown) look like 
the first cycle.

So this is not a failing disk.  0%  disk util and 100% CPU util means 
the code is stuck in some sort of fast loop that doesn't need external 
input.  It could be some legit task that it's not able to complete 
before being killed, or it could be a hung lock.

I'm going to try setting noout and nodown, and see if that helps. I'm 
trying to test if it's some start up operation (leveldb compaction or 
something) that can't complete before the other OSDs mark it down.

I'll give that an hour to see what happens.  If it's still flapping 
after that, I'll unset nodown, and disable the daemon for the time being.

-- 

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis at centraldesktop.com <mailto:clewis at centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140510/cbc26fe9/attachment.htm>