16 osds: 11 up, 16 in

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Wed, 07 May 2014 17:35:04 -0700

On 5/7/14 15:33 , Dimitri Maziuk wrote:
> On 05/07/2014 04:11 PM, Craig Lewis wrote:
>> On 5/7/14 13:40 , Sergey Malinin wrote:
>>> Check dmesg and SMART data on both nodes. This behaviour is similar to
>>> failing hdd.
>>>
>>>
>> It does sound like a failing disk... but there's nothing in dmesg, and
>> smartmontools hasn't emailed me about a failing disk.  The same thing is
>> happening to more than 50% of my OSDs, in both nodes.
> check 'iostat -dmx 5 5' (or some other numbers) -- if you see 100%+ disk
> utilization, that could be the dying one.
>
>

About an hour after I applied the osd_recovery_max_active=1, things 
settled down.  Looking at the graphs, it looks like most of the OSDs 
crashed one more time, then started working correctly.

Because of the very low recovery parameters, there's on a single 
backfill running.  `iostat -dmx 5 5` did report 100% util on the osd 
that is backfilling, but I expected that.  Once backfilling moves on to 
a new osd, the 100% util follows the backfill operation.

There's a lot of recovery to finish.  Hopefully this will last until it 
completes.  If so, I'm adding osd_recovery_max_active=1 to ceph.conf.

-- 

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis at centraldesktop.com <mailto:clewis at centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140507/9c6f0aa5/attachment.htm>