Re: OSD::disk_tp timeout

Sage Weil <sage@xxxxxxxxxxxx> · Sat, 8 Oct 2011 14:28:45 -0700 (PDT)

Hi Christian,

On Sat, 8 Oct 2011, Christian Brunner wrote:
> Hi,
> 
> I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
> screwed ceph cluster. :(
> 
> What bugs me most is the fact, that OSDs become unresponsive
> frequently. The process is eating a lot of cpu and I can see the

What version of btrfs are you running?  This sound a bit like the bug 
fixed by this patch:

http://www.spinics.net/lists/linux-btrfs/msg12627.html

(That was just merged into mainline this week.)

> following messages in the log:
> 
> Oct  8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> Oct  8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
> is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
> 
> Do you have any idea, what to do about that?

Those messages just mean that a thread in the disk threadpool (which is 
doing all the writes to btrfs) is blocked/stopped.

sage