On Fri, 18 Mar 2011, huang jun wrote: > hi, > my ceph cluster is: > 1 mon,1 mds, 6 osds > use a client to write files > but after two days,the client can not write anymore , dmesg show: > [152312.784043] libceph: tid 221531 timed out on osd2, will reset osd > [152322.800025] libceph: tid 221534 timed out on osd1, will reset osd > [152362.864029] libceph: tid 221553 timed out on osd3, will reset osd > [152362.864115] libceph: tid 221556 timed out on osd0, will reset osd > [152362.864175] libceph: tid 221558 timed out on osd4, will reset osd > [152362.864236] libceph: tid 221568 timed out on osd5, will reset osd > [152372.880024] libceph: tid 221531 timed out on osd2, will reset osd > [152432.976030] libceph: tid 221531 timed out on osd2, will reset osd > [152493.072035] libceph: tid 221531 timed out on osd2, will reset osd > [152553.168039] libceph: tid 221531 timed out on osd2, will reset osd > [152613.264027] libceph: tid 221531 timed out on osd2, will reset osd > [152673.360028] libceph: tid 221531 timed out on osd2, will reset osd > [152733.456028] libceph: tid 221531 timed out on osd2, will reset osd > [152793.552026] libceph: tid 221531 timed out on osd2, will reset osd > [152853.648025] libceph: tid 221531 timed out on osd2, will reset osd > [152913.744029] libceph: tid 221531 timed out on osd2, will reset osd > [152973.840026] libceph: tid 221531 timed out on osd2, will reset osd > [153033.936026] libceph: tid 221531 timed out on osd2, will reset osd > > and on osd2: > dmesg show : > > [140056.772753] btrfs: truncated 1 orphans > [140108.340423] btrfs: truncated 1 orphans > [141681.918175] btrfs: truncated 1 orphans > [148394.437973] btrfs: truncated 1 orphans > [152007.353121] btrfs: truncated 1 orphans > [152338.400197] btrfs: truncated 1 orphans > [152880.944055] INFO: task btrfs-transacti:3046 blocked for more than > 120 seconds. > [152880.944341] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [152880.944664] btrfs-transac D ffff88007e0996c8 0 3046 2 0x00000000 > [152880.944677] ffff88007e28d6f0 0000000000000046 0000000000000002 > 0000000000013500 > [152880.944688] ffff88006f5b3fd8 ffff88006f5b3fd8 ffff88007e099430 > 0000000000013500 > [152880.944699] 0000000000013500 0000000000013500 ffff88007e099430 > 0000000000000286 > [152880.944710] Call Trace: > [152880.944727] [<ffffffff8114c646>] ? wait_for_commit+0x8f/0xd5 > [152880.944738] [<ffffffff810536e2>] ? autoremove_wake_function+0x0/0x2e > [152880.944748] [<ffffffff8114d4cf>] ? btrfs_commit_transaction+0xff/0x5ec > [152880.944759] [<ffffffff8130ecb4>] ? schedule_timeout+0x202/0x222 > [152880.944769] [<ffffffff810536e2>] ? autoremove_wake_function+0x0/0x2e > [152880.944779] [<ffffffff8114928d>] ? transaction_kthread+0x158/0x20c > [152880.944789] [<ffffffff81149135>] ? transaction_kthread+0x0/0x20c > [152880.944798] [<ffffffff81053299>] ? kthread+0x79/0x81 > [152880.944808] [<ffffffff81003824>] ? kernel_thread_helper+0x4/0x10 > [152880.944818] [<ffffffff81053220>] ? kthread+0x0/0x81 > [152880.944827] [<ffffffff81003820>] ? kernel_thread_helper+0x0/0x10 > [152880.944837] INFO: task cosd:3137 blocked for more than 120 seconds. > [152880.945157] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [152880.945513] cosd D ffff88006f5b5038 0 3137 1 0x00000000 > [152880.945523] ffff88007ebd34b0 0000000000000086 0000000000000000 > 0000000000013500 > [152880.945531] ffff88007e2dffd8 ffff88007e2dffd8 ffff88006f5b4da0 > 0000000000013500 > [152880.945540] 0000000000013500 0000000000013500 ffff88006f5b4da0 > ffffffff810a5ba3 > > everything seems ok after we out osd2,client works fluently. > does the problem relate to btrfs ? Yes. When the OSD gets hung up the client writes stall. We added a timeout mechanism that should force the cosd daemon to fail when the underlying file system isn't responsive, though... which version are you running on the server side? sage