Re: cosd multi-second stalls cause "wrongly marked me down"

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Wed, 9 Mar 2011 11:37:04 -0800



On Wednesday, March 9, 2011 at 10:36 AM, Jim Schutt wrote:
> Here's another example with more debugging. The
> PG count during this interval is:
> 
> 2011-03-09 10:35:58.306942 pg v379: 25344 pgs: 25344 active+clean; 12119 MB data, 12025 MB used, 44579 GB / 44787 GB avail
> 2011-03-09 10:36:42.177728 pg v462: 25344 pgs: 25344 active+clean; 46375 MB data, 72672 MB used, 44520 GB / 44787 GB avail
> 
> Check out the interval 10:36:23.473356 -- 10:36:27.922262
> 
> It looks to me like a heartbeat message submission is 
> waiting on something?

Yes, it sure does. The only thing that should block between those output messages is getting the messenger lock, which *ought* be fast. Either there are a lot of threads trying to send messages and the heartbeat thread is just getting unlucky, or there's a mistake in where and how the messenger locks (which is certainly possible, but in a brief audit it looks correct).
-Greg


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html