On Wed, Sep 24, 2014 at 12:12 PM, Micha Krause <micha at krausam.de> wrote: > Hi, > > I was able to get a dmesg output from the centos Machine with kernel 3.16: > > kworker/3:2:9521 blocked for more than 120 seconds. > Not tainted 3.16.2-1.el6.elrepo.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kworker/3:2 D 0000000000000003 0 9521 2 0x00000080 > Workqueue: events handle_timeout [libceph] > ffff8801228cfcd8 0000000000000046 0000000300000000 ffff8801228cc010 > 0000000000014400 0000000000014400 ffff8800ba01c250 ffff880234ed3070 > 0000000000000000 ffff8800baf237c8 ffff8800baf237cc ffff8800ba01c250 > Call Trace: > [<ffffffff81647da9>] schedule+0x29/0x70 > [<ffffffff81647f0e>] schedule_preempt_disabled+0xe/0x10 > [<ffffffff8164987b>] __mutex_lock_slowpath+0xdb/0x1d0 > [<ffffffff81649993>] mutex_lock+0x23/0x40 > [<ffffffffa0348c73>] handle_timeout+0x63/0x1c0 [libceph] > [<ffffffff8108d60c>] process_one_work+0x17c/0x420 > [<ffffffff8108e7d3>] worker_thread+0x123/0x420 > [<ffffffff8108e6b0>] ? maybe_create_worker+0x180/0x180 > [<ffffffff810943be>] kthread+0xce/0xf0 > [<ffffffff810942f0>] ? kthread_freezable_should_stop+0x70/0x70 > [<ffffffff8164b5bc>] ret_from_fork+0x7c/0xb0 > [<ffffffff810942f0>] ? kthread_freezable_should_stop+0x70/0x70 > INFO: task kworker/3:1:62 blocked for more than 120 seconds. > Not tainted 3.16.2-1.el6.elrepo.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kworker/3:1 D 0000000000000003 0 62 2 0x00000000 > Workqueue: events handle_osds_timeout [libceph] > ffff880037907ce8 0000000000000046 0000000000000000 ffff880037904010 > 0000000000014400 0000000000014400 ffff880232389130 ffff880234ed3070 > ffffffff8101d833 ffff8800baf237c8 ffff8800baf237cc ffff880232389130 > Call Trace: > [<ffffffff8101d833>] ? native_sched_clock+0x33/0xd0 > [<ffffffff81647da9>] schedule+0x29/0x70 > [<ffffffff81647f0e>] schedule_preempt_disabled+0xe/0x10 > [<ffffffff8164987b>] __mutex_lock_slowpath+0xdb/0x1d0 > [<ffffffff810afbbf>] ? put_prev_entity+0x2f/0x400 > [<ffffffff81649993>] mutex_lock+0x23/0x40 > [<ffffffffa0347003>] handle_osds_timeout+0x53/0x120 [libceph] > [<ffffffff8108d60c>] process_one_work+0x17c/0x420 > [<ffffffff8108e7d3>] worker_thread+0x123/0x420 > [<ffffffff8108e6b0>] ? maybe_create_worker+0x180/0x180 > [<ffffffff810943be>] kthread+0xce/0xf0 > [<ffffffff810942f0>] ? kthread_freezable_should_stop+0x70/0x70 > [<ffffffff8164b5bc>] ret_from_fork+0x7c/0xb0 > [<ffffffff810942f0>] ? kthread_freezable_should_stop+0x70/0x70 > INFO: task kworker/u8:0:9486 blocked for more than 120 seconds. > Not tainted 3.16.2-1.el6.elrepo.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kworker/u8:0 D 0000000000000002 0 9486 2 0x00000080 > Workqueue: writeback bdi_writeback_workfn (flush-253:7) > ffff8802337cf368 0000000000000046 00000000ae5d42c1 ffff8802337cc010 > 0000000000014400 0000000000014400 ffff880232554fb0 ffff8800ba4be210 > ffff8802337cc010 ffff8800ba5579b8 ffff880232fa0250 ffff880232554fb0 > Call Trace: > [<ffffffff81647da9>] schedule+0x29/0x70 > [<ffffffff81647f0e>] schedule_preempt_disabled+0xe/0x10 > [<ffffffff81649962>] __mutex_lock_slowpath+0x1c2/0x1d0 > [<ffffffff81649993>] mutex_lock+0x23/0x40 > [<ffffffffa033fc2d>] ceph_con_send+0x4d/0x150 [libceph] > [<ffffffffa0348bc4>] __send_queued+0x134/0x180 [libceph] > [<ffffffffa0349e7b>] __ceph_osdc_start_request+0x5b/0xb0 [libceph] > [<ffffffffa0349f21>] ceph_osdc_start_request+0x51/0x80 [libceph] > [<ffffffffa037b2a0>] rbd_img_obj_request_submit+0xb0/0x110 [rbd] > [<ffffffffa037b349>] rbd_img_request_submit+0x49/0x60 [rbd] > [<ffffffffa037bcd8>] rbd_request_fn+0x248/0x2b0 [rbd] > [<ffffffff812b22e7>] __blk_run_queue+0x37/0x50 > [<ffffffff812b296e>] queue_unplugged+0x4e/0xb0 > [<ffffffff812b2b2e>] blk_flush_plug_list+0x15e/0x200 > [<ffffffff81647e65>] io_schedule+0x75/0xd0 > [<ffffffff812b3f87>] get_request+0x167/0x340 > [<ffffffff810b6220>] ? bit_waitqueue+0xe0/0xe0 > [<ffffffff812ae78b>] ? elv_merge+0xeb/0xf0 > [<ffffffff812b4228>] blk_queue_bio+0xc8/0x340 > [<ffffffff812b30f0>] generic_make_request+0xc0/0x100 > [<ffffffff812b31b0>] submit_bio+0x80/0x170 > [<ffffffff812abdf1>] ? bio_alloc_bioset+0xa1/0x1e0 > [<ffffffff811ff4a6>] _submit_bh+0x146/0x220 > [<ffffffff811ff590>] submit_bh+0x10/0x20 > [<ffffffff81202ed3>] __block_write_full_page.clone.0+0x1a3/0x340 > [<ffffffff81203790>] ? I_BDEV+0x10/0x10 > [<ffffffff81203790>] ? I_BDEV+0x10/0x10 > [<ffffffff81203246>] block_write_full_page+0xc6/0x100 > [<ffffffff81204848>] blkdev_writepage+0x18/0x20 > [<ffffffff81163be7>] __writepage+0x17/0x50 > [<ffffffff81164fe4>] write_cache_pages+0x244/0x510 > [<ffffffff81163bd0>] ? set_page_dirty+0x60/0x60 > [<ffffffff81165301>] generic_writepages+0x51/0x80 > [<ffffffff81165350>] do_writepages+0x20/0x40 > [<ffffffff811f6309>] __writeback_single_inode+0x49/0x230 > [<ffffffff810b665f>] ? wake_up_bit+0x2f/0x40 > [<ffffffff811f7149>] writeback_sb_inodes+0x279/0x390 > [<ffffffff811d03d5>] ? put_super+0x25/0x40 > [<ffffffff811f72fe>] __writeback_inodes_wb+0x9e/0xd0 > [<ffffffff811f752b>] wb_writeback+0x1fb/0x2c0 > [<ffffffff811f76f0>] wb_do_writeback+0x100/0x1f0 > [<ffffffff811f7a60>] bdi_writeback_workfn+0x70/0x210 > [<ffffffff8108d60c>] process_one_work+0x17c/0x420 > [<ffffffff8108e7d3>] worker_thread+0x123/0x420 > [<ffffffff8108e6b0>] ? maybe_create_worker+0x180/0x180 > [<ffffffff810943be>] kthread+0xce/0xf0 > [<ffffffff810942f0>] ? kthread_freezable_should_stop+0x70/0x70 > [<ffffffff8164b5bc>] ret_from_fork+0x7c/0xb0 > [<ffffffff810942f0>] ? kthread_freezable_should_stop+0x70/0x70 Sorry, this is a known rbd deadlock in 3.15/3.16. 3.16.3 has a fix. I'd be very interested to see something similar for 3.13. Thanks, Ilya