Re: cluster busy, cause heartbeat exceptional, cluster becomes more busy

Sage Weil <sweil@xxxxxxxxxx> · Wed, 25 Nov 2015 05:16:13 -0800 (PST)

On Wed, 25 Nov 2015, Chenxiaowei wrote:
>          We met another serious problem as follows:
> 
> During backfill,rbd client send ops to cluster, slow request came up, and so
> 
> When osd heartbeat came in,  check cct->get_heartbeat_map()->is_healthy()
> return false,
> 
> So other osd will not receive heartbeat and report failure info to monitor,
> monitor mark osd down leading
> 
> to more osd peering, cluster more busy, so here comes the question:
> 
> why osd heartbeat check logic combined with heartbeatmap( check other
> threadpool and so on) ? ? ?
> 
> I am really confused about this logic. Wish your reply.

The idea is simply that if the OSD is not healthy (e.g., stuck op thread) 
it should not respond to heartbeats and tell other OSDs that it is 
healthy.  It should get marked down.  After it recovers 
(wait_for_healthy), then it can rejoin the cluster.  (Or, more likely, it 
the thread is completely stuck and it will suicide.)

I think the issue is that the backfill + client load was enough to make 
is_healthy() fail.. that really shouldn't be happening.  As long as the 
threads are making progress they won't fail their internal heartbeat 
checks--that only happens if they get completely stuck.  I suspect 
something else broke?

sage