On Mon, Nov 3, 2014 at 11:41 AM, Chad Seys <cwseys@xxxxxxxxxxxxxxxx> wrote: > On Monday, November 03, 2014 13:22:47 you wrote: >> Okay, assuming this is semi-predictable, can you start up one of the >> OSDs that is going to fail with "debug osd = 20", "debug filestore = >> 20", and "debug ms = 1" in the config file and then put the OSD log >> somewhere accessible after it's crashed? > > Alas, I have not yet noticed a pattern. Only thing I think is true is that > they go down when I first make CRUSH changes. Then after restarting, they run > without going down again. > All the OSDs are running at the moment. Oh, interesting. What CRUSH changes exactly are you making that are spawning errors? > What I've been doing is marking OUT the OSDs on which a request is blocked, > letting the PGs recover, (drain the OSD of PGs completely), then remove and > readd the OSD. > > So far OSDs treated this way no longer have blocked requests. > > Also, seems as though that slowly decreases the number of incomplete and > down+incomplete PGs . > >> >> Can you also verify that all of your monitors are running firefly, and >> then issue the command "ceph scrub" and report the output? > > Sure, should I wait until the current rebalancing is finished? I don't think it should matter, although I confess I'm not sure how much monitor load the scrubbing adds. (It's a monitor check; doesn't hit the OSDs at all.) _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com