Here is a sysrq-t trace. I'm running 4 OSDs on the server. The one that is causing problems has pid 31956. Thanks, Christian 2011/10/9 Sage Weil <sage@xxxxxxxxxxxx>: > On Sun, 9 Oct 2011, Martin Mailand wrote: >> Hi, >> I am using v3.1-rc9, so the fix in there. Maybe I can nail it down a bit more >> specific. > > You might try sysrq-t or -w to see what the spinning CPUs are doing. > > Thanks! > sage > > >> >> Best Regards, >> martin >> >> Sage Weil schrieb: >> > Hi Christian, >> > >> > On Sat, 8 Oct 2011, Christian Brunner wrote: >> > > Hi, >> > > >> > > I've upgraded ceph from 0.32 to 0.36 yesterday. Now I have a totaly >> > > screwed ceph cluster. :( >> > > >> > > What bugs me most is the fact, that OSDs become unresponsive >> > > frequently. The process is eating a lot of cpu and I can see the >> > >> > What version of btrfs are you running? This sound a bit like the bug fixed >> > by this patch: >> > >> > http://www.spinics.net/lists/linux-btrfs/msg12627.html >> > >> > (That was just merged into mainline this week.) >> > >> > > following messages in the log: >> > > >> > > Oct 8 22:30:05 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map >> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 >> > > Oct 8 22:30:10 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map >> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 >> > > Oct 8 22:30:15 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map >> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 >> > > Oct 8 22:30:20 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map >> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 >> > > Oct 8 22:30:25 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map >> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 >> > > Oct 8 22:30:30 os00 osd.000[31688]: 7fe0f3b9c700 heartbeat_map >> > > is_healthy 'OSD::disk_tp thread 0x7fe0e527e700' had timed out after 60 >> > > >> > > Do you have any idea, what to do about that? >> > >> > Those messages just mean that a thread in the disk threadpool (which is >> > doing all the writes to btrfs) is blocked/stopped. >> > >> > sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Attachment:
sysrq-t.txt.gz
Description: GNU Zip compressed data