On Mon, 11 Apr 2011, Jim Schutt wrote: > > > > I guess the other thing that would help to confirm this is to just halve > > > > the number of OSDs on your machines in a test and see if the problem > > > > goes > > > > away. > > > I was going to try this first, exactly because it seems like > > > a definitive test. > > > > > > > > If my analysis above is correct, do you think anything > > > > > can be gained by running the heartbeat and heartbeat > > > > > dispatcher threads as SCHED_RR threads? Since tick() runs > > > > > heartbeat_check(), that would also need to be SCHED_RR, > > > > > or the heartbeats could arrive on time, but not checked > > > > > until it was too late. > > > > Thanks for the ideas. However, I doubt that making the OSD::tick() > > thread SCHED_RR would really work. > > > > The OSD::tick() code is taking locks all over the place. Since a bunch > > of other threads besides the tick thread can be holding those locks, > > this would soon result in priority inversion. Not to mention, > > heartbeat_messenger has its own thread(s) which actually perform the > > work of sending the heartbeat messages. > > Yes, I think I understand. We could set the priority for those threads as well, but I'm not sure that really addresses the problem: we may end up with a situation where cosd is responding to heartbeats but not doing useful work. At some point you have to consider highly degraded service a failure. Let's see if we can fix it without adjusting priorities first! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html