Re: OSD::disk_tp timeout

Christian Brunner <chb@xxxxxx> · Sun, 9 Oct 2011 08:02:27 +0200

Here is a sysrq-t trace.

I'm running 4 OSDs on the server. The one that is causing problems has
pid 31956.

Thanks,
Christian

2011/10/9 Sage Weil <sage@xxxxxxxxxxxx>:
> On Sun, 9 Oct 2011, Martin Mailand wrote:
>> Hi,
>> I am using v3.1-rc9, so the fix in there. Maybe I can nail it down a bit more
>> specific.
>
> You might try sysrq-t or -w to see what the spinning CPUs are doing.
>
> Thanks!
> sage
>
>
>>
>> Best Regards,
>>  martin
>>
>> Sage Weil schrieb:
>> > Hi Christian,
>> >
>> > On Sat, 8 Oct 2011, Christian Brunner wrote:
>> > > Hi,
>> > >
>> > > I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly
>> > > screwed ceph cluster. :(
>> > >
>> > > What bugs me most is the fact, that OSDs become unresponsive
>> > > frequently. The process is eating a lot of cpu and I can see the
>> >
>> > What version of btrfs are you running?  This sound a bit like the bug fixed
>> > by this patch:
>> >
>> > http://www.spinics.net/lists/linux-btrfs/msg12627.html
>> >
>> > (That was just merged into mainline this week.)
>> >
>> > > following messages in the log:
>> > >
>> > > Oct  8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct  8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct  8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct  8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct  8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > > Oct  8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map
>> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60
>> > >
>> > > Do you have any idea, what to do about that?
>> >
>> > Those messages just mean that a thread in the disk threadpool (which is
>> > doing all the writes to btrfs) is blocked/stopped.
>> >
>> > sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Attachment:
sysrq-t.txt.gz

Description: GNU Zip compressed data