16 osds: 11 up, 16 in

hell@xxxxxxxxxxx (Sergey Malinin) · Wed, 7 May 2014 23:15:01 +0300

Is there anything unusual in dmesg at osd.5?

On Wednesday, May 7, 2014 at 23:09, Craig Lewis wrote:

> I already have osd_max_backfill = 1, and osd_recovery_op_priority = 1.  
> 
> osd_recovery_max_active is the default 15, so I'll give that a try...  some OSDs timed out during the injectargs.  I added it to ceph.conf, and restarted them all.  
> 
> I was running RadosGW-Agent, but it's down now.  I disabled scrub and deep-scrub as well.  All the Disk I/O is dedicated to recovery now.
> 
> 15 minutes after the restart:
> 2014-05-07 13:03:19.249179 mon.0 [INF] osd.5 marked down after no pg stats for 901.601323seconds
> 
> One of the OSDs (osd.5) didn't complete the peering process.  It's like the OSD locked up immediately after restart.  It looks like it too.  As soon as osd.5 started peering, it went to exactly 100% CPU, and other OSDs start complaining that it wasn't responding to subops.
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140507/a4c6184f/attachment.htm>