On 5/7/14 15:33 , Dimitri Maziuk wrote: > On 05/07/2014 04:11 PM, Craig Lewis wrote: >> On 5/7/14 13:40 , Sergey Malinin wrote: >>> Check dmesg and SMART data on both nodes. This behaviour is similar to >>> failing hdd. >>> >>> >> It does sound like a failing disk... but there's nothing in dmesg, and >> smartmontools hasn't emailed me about a failing disk. The same thing is >> happening to more than 50% of my OSDs, in both nodes. > check 'iostat -dmx 5 5' (or some other numbers) -- if you see 100%+ disk > utilization, that could be the dying one. > > About an hour after I applied the osd_recovery_max_active=1, things settled down. Looking at the graphs, it looks like most of the OSDs crashed one more time, then started working correctly. Because of the very low recovery parameters, there's on a single backfill running. `iostat -dmx 5 5` did report 100% util on the osd that is backfilling, but I expected that. Once backfilling moves on to a new osd, the 100% util follows the backfill operation. There's a lot of recovery to finish. Hopefully this will last until it completes. If so, I'm adding osd_recovery_max_active=1 to ceph.conf. -- *Craig Lewis* Senior Systems Engineer Office +1.714.602.1309 Email clewis at centraldesktop.com <mailto:clewis at centraldesktop.com> *Central Desktop. Work together in ways you never thought possible.* Connect with us Website <http://www.centraldesktop.com/> | Twitter <http://www.twitter.com/centraldesktop> | Facebook <http://www.facebook.com/CentralDesktop> | LinkedIn <http://www.linkedin.com/groups?gid=147417> | Blog <http://cdblog.centraldesktop.com/> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140507/9c6f0aa5/attachment.htm>